This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of 
the original documents submitted by the appHcant. 

Defects in the images may include (but are not limited to): 

• BLACK BORDERS 

• TEXT CUT OFF AT TOP, BOTTOM OR SIDES 

• FADED TEXT 

• ILLEGIBLE TEXT 

• SKEWED/SLANTED IMAGES 

• COLORED PHOTOS 

• BLACK OR VERY BLACK AND WHITE DARK PHOTOS 

• GRAY SCALE DOCUMENTS 

IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



WORLD INTELLECTUAL PROPERTY ORGAI^nZATION 

International Bureau 




Reference I of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 



PCX 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) Internationa] Patent Classification ^ : 

C12Q 1/68 



Al 



(11) International Publication Number: WO 95/215|44 

(43) International Publication Date: 17 August 1995 (17.08.95) 



(21) International Application Number: PCT/US95/01863 

(22) International Filing Date: 14 Febniaiy 1995 (14.02,95) 



(30) Priority Data: 
08/195,485 



1 4. February 1 994 ( 1 4.02.94) US 



(60) Parent Application or Grant 

(63) Related by Continuation 
US 

Filed on 



08/195,485 (CIP) 
14 February 1994 (14.02.94) 



(71) Applicant {for all designated States except US)i SMTTHKLINE 
BEECHAM CORPORATION [US/US]; Corporate Intellec- 
tual Property, UW2220» 709 Swedeland Road. P.O. Box 
1539. King of Prussia, PA 19406-0939 (US). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): ROSENBERG. Martin 
[US/US]; 241 Mingo Road, Royersford. PA 19468 (US). 
DEBOUCK. Christine [BE/US]; 667 Pugh Road. Wayne, 
PA 19087 (US). BERGSMA. Derk [US/US]; 271 Irish Road, 
Berwyn, PA 19312 (US). 



(74) Agents: JERVIS. Herbert, H. et al.; SmithKline Beecham 
Corporation, Corporate Intellectual Property, UW2220, 709 
Swedeland Road, P.O. Box 1539, King of Prussia. PA 
19406-0939 (US). ^' 



(81) Designated States: JP. US, European patent (AT. BE, CH. DE. 
DK. ES. FR. GB. GR. IE. IT. LU, MC, NL, PT, SE). 



Published 

With international search report. 



(54) Title: DlFFERENmAU-Y EXPRESSED GENES IN HEALTHY AND DISEASED SUBJECTS 
(57) Abstract 

The present invention involves methods and compositions for identifying genes which are differentially expressed in a normal healthy 
animal and an animal having a selected disease or infection, and methods for diagnosing diseases or infections characterized by the presence 
of those genes, despite the absence of knowledge about the gene or its function. The methods involve the use of a composition suitable 
for use in hybridization which consists of a solid surface on which is immobilized at pre-defined regions thereon a plurality of defined 
oligonucleotide/polynucleotide sequences for hybridization. Each sequence comprises a fragment of an EST isolated from an ideiitified 
DNA library prepared from tissue or cell samples of a healthy animal, an animal with a selected disease or infection, and any combination 
thereof. Differences in hybridization patterns produced through use of this composition and the specified methods enable diagnosis of 
disease based on differential expression of genes of unknown function, and enable the identification of those genes and the proteins encoded 
thereby. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international 
applications under the PCT. 



AT 


Austria 


GB 


United Kingdom 


MR 


Mauritania 


AU 


Australia 


GE 


Georgia 


MW 


Malawi 


BB 


Barbados 


GN 


Guinea 


NE 


Niger 


BE 


Belgium 


GR 


Greece 


NL 


Netherlands 


BF 


Burkina Faso 


HU 


Hungary 


NO 


Norway 


BG 


Bulgaria 


IE 


Ireland 


NZ 


New Zealand 


BJ 


Benin 


rr 


Italy 


PL 


Poland 


BR 


Brazil 


JP 


Japan 


FT 


Portugal 


BY 


Belarus 


K£ 


Kenya 


RO 


Romania 


CA 


Canada 


KG 


Kyrgystan 


RU 


Russian Federation 


CF 


Central African Republic 


KP 


Democratic People's Republic 


SD 


Sudan 


CG 


Congo 




of Korea 


SE 


Sweden 


CH 


Switzerland 


KR 


Republic of Korea 


SI 


Slovenia 


CI 


C6ie d'l voire 


KZ 


Kazakhstan 


SK 


Slovakia 


CM 


Cameroon 


LI 


Liechtenstein 


SN 


Senegal 


CN ■ 


China 


LK 


Sri Lanka 


TD 


Chad 


cs 


Czechoslovakia 


LU 


Luxembourg 


TG 


Togo 


cz 


Czech Republic 


LV 


Latvia 


TJ 


Tajikistan 


DE 


Germany 


MC 


Monaco 


TT 


Trinidad and Tobago 


DK 


Denmark 


MD 


Republic of Moldova 


UA 


Ukraine 


ES 


Spain 


MG 


Madagascar 


US 


United States of America 


n 


Finland 


ML 


Mali 


UZ 


Uzbekistan 


FR 


France 


MN 


Mongolia 


VN 


Vici Nam 


GA 


Gabon 











wo 95/21944 



PCTAJS95/01863 



differentially expressed genes in healthy and diseased subjects 

Cross Reference to Related Applications: 
5 This application is a continuation-in-part application of U.S. Serial No. 

08/195,485 filed February 14, 1994, the contents of which are incorporated herein by 
reference. 

Field of the Invention 

10 The present invention relates to the use of immobilized 

oligonucleotide/polynucleotide or polynucleotide sequences for the identification, 
sequencing and characterization of genes which are implicated in disease, infection, 
or development and the use of such identified genes and the proteins encoded thereby 
in diagnosis, prognosis, therapy and drug discovery. 

15 

Background of the Invention 

Identification, sequencing and characterization of genes, especially 
human genes, is a major goal of modem scientific research. By identifying genes, 
determining their sequences and characterizing their biological function, it is possible 

20 to employ recobinant DNA technology to produce large quantities of valuable "gene 
products", e.g., proteins and peptides. Additionally, knowledge of gene sequences 
can provide a key to diagnosis, prognosis and treatment of a variety of disease states 
in plants and animals which are characterized by inappropriate expression and/or 
repression of selected gene(s) or by the influence of external factors, e.g., carcinogens 

25 or teratogens, on gene function. The term disease-associated genes(s) is used herein 
in its broadest sence to mean not only genes associated with classical inherited 
diseases, but also those associated with genetic predisposition to disease as well as 
infectious or pathogenic states resulting from gene expression by infectious agents or 
the effect on host cell gene expression by the presence of such a pathogen or its 

30 products Locating disease-associated genes will permit the development of 
diagnostic and prognostic reagents and methods, as well as possible therapeutic 
regimens, and the discovery of new drugs for treating or preventing the occurrence of 
such diseases. 

Methods have been described for the identification of certain novel 
35 gene sequences, referred to as Expressed Sequence Tags (EST) [see, e.g„ Adams et 
al, Science . 252:1651-1656 (1991); and International Patent Application No. 
WO93/00353, published January 7, 1993]. Conventially, an EST is a specific cDNA 
polynucleotide sequence, or tag, about 150 to 400 nucleotides in length, derived from 



wo 95/21944 



PCT/US95/01863 



a messenger RNA molecule by reverse transcription, which is a marker for, and 
component of, a human gene actually transcribed in vivo. However, as used herein an 
EST also refers to a genomic DNA fragment derived from an organism, such as a 
microorganism,the DNA of which lacks intron regions. 
5 A variety of techniques have been described for identifying particular 

gene sequences on the basis of their gene products. For example, several techniques 
are described in the art [see, e.g.. International Patent Application No. WO91/07087, 
published May 30, 1991]. Additionally, known methods exist for the amplification of 
desired sequences [see, e.g.. International Patent Application No. W091/17271, 

10 published November 14, 1991, among others]. 

However, at present, there exist no established methods for filling the 
need in the art for methods and reagents which employ fragments of differentially 
expressed genes of known, unknown (or previously unrecognized ) function or 
consequence to provide diagnostic and therapeutic methods and reagents for diagnosis 

15 and treatment of disease or infection, which conditions are characterized by such 
genes and gene products. It should be appreciated that it is the expression differences 
that are diagnostic of the altered state (e.g., predisease, disease, pathogenic, 
progression or infectious). Such genes associated with the altered state are likely to 
be the targets of drug discovery, whether the genes are the cause or the effect of the 

20 condition, identification of such genes provides insight into which gene expression 
needs to be re-altered in order to reestablished the healthy state. 

Summary of the Invention 

In one aspect, the invention provides methods for identifying gene(s) 

25 which are differentially expressed, for example, in a normal healthy organism and an 
organism having a disease. The method involves producing and comparing 
hybridization patterns formed between samples of expressed mRNA or cDNA 
polynucleotide sequences obtained from either analogous cells, tissues or organs of a 
healthy organism and a diseased organism and a defined set of 

30 oligonucleotide/polynucleotide/polynucleotide sequence prdbes from either an 
healthy organism or a diseased organism immobilized on a support. Those defined 
oligonucleotide/polynucleotide sequences are representative of the total expressed 
genetic component of the cells, tissues, organs or organism as defined the collection 
of partial cDNA sequences (ESTs). The differences between the hybridization 

35 patterns permit identification of those particular EST or gene-specific 
oligonucleotide/polynucleotide sequences associated with differential expression, and 
the identification of the EST permits identification of the clone from which it was 



wo 95/2 1 944 PCT/US95/0 1 863 

derived and using ordinary skill further cloning and, if desired, sequencing of the full- 
length cDNA and genomic counterpart, i.e., gene, from which it was obtained. 

In another aspect, the invention provides methods substantially similar 
to those described above, but which permit identification of those gene(s) of a 
5 pathogen which are expressed in any biological sample of an infected organism based 
on comparative hybridization of RNA/cDNA samples derived from a healthy versus 
infected organism, hybridized to an oligonucleotide/polynucleotide set representative 
of the gene coding complement of the pathogen of interest. 

In another aspect, the invention provides methods substantially similar 
10 to those described above, but which permit identification of those ESTs-specific 
oligonucleotide/polynucleotide sequences of host gene(s) which represent genes being 
differentially expressed/ altered in expression by the disease state, or infection and are 
expressed in any biological sample of an infected organism based on comparative 
hybridization of RNA/cDNA samples derived from a healthy versus infected 
15 organism of interest 

In a further aspect, the methods described above and in detail below, 
also provide methods for diagnosis of diseases or infections characterized by 
differentially expressed genes, the expression of which has been altered as a result of 
infection by the pathogen or disease causing agent in question. All identified 
20 differences provide the basis for diagnostic testing be it the altered expression of 
endogenous genes or the patterned expression of the genes of the infecting organism. 
Such patterns of altered expression are defined by comparing RNA/cDNA from the 
two states hybridized against a panel of oligonucleotide/polynucleotides representing 
the expressed gene component of a cell, tissue, organ or organism as defined by its 
25 collection of ESTs. 

Yet a further aspect of this invention provides a composition suitable 
for use in hybridization, which comprises a solid surface on which is immobilized at 
pre-defined regions thereon a plurality of defined oligonucleotide/polynucleotide 
sequences for hybridization, each sequence comprising a fragment of an EST isolated 
30 from a cDNA or DNA library prepared from at least one selected tissue or cell 
sample of a healthy (i.e., pre-disease state) animal, at least one analogous sample of 
an animal having a disease, at least one analogous sample of an animal infected with a 
pathogen or the pathogen itself, or any combination or multiple combinations thereof. 

An additional aspect of the invention provides an isolated gene 
35 sequence which is differentially expressed in a normal healthy animal and an animal 
having a disease, and is identified by the methods above. Similarly, an isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal can be identified by the methods above. 

3 



wo 95/2 1 944 PCT/US95/01863 

Yet another aspect of the invention is that it provides not only a means 
for a static diagnostic but also provides a means for a carrying out the procedure over 
time to measure disease progression as well as monitoring the efficacy of disease 
treatment regimes including an toxicological effects thereof. 
5 Another aspect of the invention is an isolated protein prxxiuced by 

expression of the gene sequences identified above. Such proteins are useful in 
therapeutic compositions or diiagnostic compositions, or as targets for drug 
development. 

Other aspects and advantages of the present invention are described 
10 further in the following detailed description of the preferred embodiments thereof. 

Petailed Description of the Invention 

The present invention meets the unfulfilled needs in the art by 
providing methods for the identification and use of gene fragments and genes, even 
15 those of unknown full length sequence and unknown function, which are 
differentially expressed in a healthy animal and in an animal having a specific disease 
or infection by use of ESTs derived from DNA libraries of healthy and/or 
diseased/infected animals. Employing the methods of this invention permits the 
resulting identification and isolation of such genes by using their corresponding ESTs 
20 and thereby also permits the production of protein products encoded by such genes. 
The genes themselves and/or protein products, if desired, may be employed in the 
diagnosis or therapy of the disease or infection with which the genes are associated 
and in the development of new drugs therefor. 

It has been appreciated that one or more differentially identified EST 
25 or gene-specific oligonucleotide/polynucleotides define a pattern of differentially 
expressed genes diagnostic of a predisease, disease or infective state. A knowledge of 
the specific biological function of the EST is not required only that the ESTs 
identifies a gene or genes whose altered expression is associated reproducibly with 
the predisease, disease or infectious state. The differences permit the identification of 
30 gene products altered in their expression by the disease and represent those products 
most likely to be targets of therapeutic intervention. Similarly, the product may be of 
the infecting organism itself and also be an effective target of intervention. 

/. Definitions, 

35 Several words and phrases used throughout this specification are 

defined as follows: 

As used herein, the term "gene" refers to the genomic nucleotide 
sequence ft-om which a cDNA sequence is derived, which cDNA produces an EST, as 

4 



wo 95/21944 PCTAJS95/01863 

described below. The term gene classically refers to the genomic sequence, which, 
upon processing, can produce different cDNAs, e.g., by splicing events. However, 
for ease of reading, any full-length counterpart cDN A sequence which gives rise to an 
EST will also be referred to by shorthand herein as a *gene'. 
5 The term "organism" includes without limitation, microbes, plants and 

animals. 

The term "animal" is used in its broadest sense to include all members 
of the animal kingdom, including hiimaris. It should be understood, however, that 
according to this invention the same species of animal which provides the biological 
10 sample also is the source of the defined immobilized oligonucleotide/polynucleotides 
as defined below. 

The term "pathogen" is defined herein as any molecule or organism 
which is capable of infecting an animal or plant and replicating its nucleic acid 
sequences in the cells or tissues of that animal or plant . Such a pathogen is generally 

15 associated with a disease condition in the infected animal or plant. Such pathogens 
may include viruses, which replicate intra- or extra-cellularly, or other organisms, 
such as bacteria, fungi or parasites, which generally infect tissues or the blood. 
Certain pathogens or microorganisms are known to exist in sequential and 
distinguishable stages of development, e.g., latent stages, infective stages, and stages 

20 which cause symptomatic diseases. In these different stages, the pathogens are 
anticipated to express differentially certain genes and/or turn on or off host cell gene 
expression. 

As used herein, the term "disease" or "disease state" refers to any 
condition which deviates from a normal or standardized healthy state in an organism 

25 of the same species in terms of differential expression of the organism's genes. In 
other words, a disease state can be any illness or disorder be it of genetic or 
environmental origin , for example, an inherited disorder such as certain breast 
cancers, or a disorder which is characterized by expression of gene(s) normally in an 
inactive, 'turned off state in a healthy animal, or a disorder which is characterized by 

30 under-expression or no expression of gene(s) which is normally activated or 'turned 
on* in a normal healthy animal. Such differential expression of genes may also be 
detected in a condition caused by infection, inflammation, or allergy, a condition 
caused by development or aging of the animal, a condition caused by administration 
of a drug or exposure of the animal to another agent, e.g., nutrition, which affects 

35 gene expression. Essentially, the methods described herein can be adapted to detect 
differential gene expression resulting from any cause, by manipulation of the defined 
oligonucleotide/polynucleotides and the samples tested as described below. The 



5 



f * 

wo 95/21944 PCTAJS95/01863 

concept of disease or disease state also includes its temporal aspects in terms of 
progression and treatment. 

The phrase "differentially expressed" refers to those situations in 
which a gene transcript is found in differing numbers of copies, or in activated vs 

5 inactivated states, in different cell types or tissue types of an organism, having a 
selected disease as contrasted to the levels of the gene transcript found in the same 
cells or tissues of a healthy organism. Genes may be differentially expressed in 
differing states of activation in microorganisms or pathogens in different stages of 
development For example, multiple copies of gene transcripts may be found in an 

10 organism having a selected disease, while only one, or significanfly fewer copies, of 
the same gene transcript are found in a healthy organism, or vice-versa. 

As used herein, the term "solid support" refers to any known substrate 
which is useful for the immobilization of large numbors of 
oligonucleotide/polynucleotide sequences by any available method to enable 

15 detectable hybridization of the immobilized oligonucleotide/polynucleotide sequences 
with other polynucleotide sequences in a sample. Among a number of available solid 
supports, one desirable example is the supports described in International Patent 
Application No. W09 1/07087, published May 30, 1991. Also useful are suports such 
as but not limited to nitrocellulose, mylein, glass, silica ans Pall Biodyne C® It is 

20 also anticipated that improvements yet to be made to conventional solid supports may 
also be employed in this invention. 

The term "surface" means any generally two-dimensional structure on 
a solid support to which the desired oligonucleotide/j>olynucleotide sequence is 
attached or immobilized. A surface may have steps, ridges, kinks, tenaces and the 

25 like. 

As used herein, the term "predefined region" refers to a localized area 
on a surface of a solid support on which is immobilized one or multiple copies of a 

4 

particular oligonucleotide/polynucleotide sequence and which enables the 
identification of the oligonucleotide/polynucleotide at the position, if hybridization of 
30 that oligonucleotide/polynucleotide to a sample polynucleotide occurs. 

By "immobilized" refers to die attachment of die 
oligonucleotide/polynucleotide to the solid support. Means of immobilization are 
known and conventional to those of skill in the art, and may depend on the type of 
support being used, 

35 By "EST" or "Expressed Sequence Tag" is meant a partial DNA or 

cDNA sequence of about 150 to 500, more preferably about 300, sequential 
nucleotides of a longer sequence obtained from a genomic or cDNA library prepared 
from a selected cell, cell type, tissue or tissue type, organ or organism which longer 

6 



wo 95/21944 



PCTAJS95/01863 



sequence corresponds to an mRNA of a gene found in that library. An EST is 
generally DNA. One or more libraries made from a single tissue type typically 
provide at least about 3000 different (i.e., unique) ESTs and potentially the full 
complement of all possible ESTs representing all cDNAs e.g., 50,000-100,000 in an 
5 animal such as a human. Further background and information on the construction of 
ESTs is described in M. D. Adams et al. Science . 252:1651-1656 (1991); and 
International Application Number PCr/US92/0522^^ 7, 1993). 

As used herein, the term "defined oligonucleotide/polynucleotide 
sequence" refers to a known nucleotide sequence fragment of a selected EST or gene. 
10 This term is used interchangeably with the term "fragments of EST". These 
sequential sequences are generally comprised of between about 15 to about 45 
nucleotides and more preferably between about 20 to about 25 nucleotides in length. 
Thus any single EST of 300 nucleotides in length may provide about 280 different 
defined oligonucleotide/polynucleotide sequences of 20 nucleotides in length (e.g., 
15 20-mers). The lengths of the defined oligonucleotide/polynucleotides may be readily 
increased or decreased as desired or needed, depending on the limitations of the solid 
support on which they may be immobilized or the requirements of the hybridization 
conditions to be employed.The length is generally guided by the principle that it 
should be of sufficient length to insure that it is one average only represented once in 
20 the population to be examined. Generally, these defined 

oligonucleotide/polynucleotides are RNA or DNA and are preferably derived from 
the anti-sense strand of the EST sequence or from a corresponding mRNA sequence 
to enable their hybridization with samples of RNA or DNA. Modified nucleotides 
may be incorporated to increase stability and hybridization properties. 
25 By the term "plurality of defined oligonucleotide/polynucleotide 

sequences" is meant the following. A surface of a solid support may immobilize a 
large number of "defined oligonucleotide/polynucleotides". For example, depending 
upon the nature of the surface, it can immobilize from about 300 to upwards of 
60,000 defined 20-mer oligonucleotide/polynucleotides. It is anticipated that future 
30 improvements to solid surfaces will permit considerably larger such pluralities to be 
immobilized on a single surface. A "plurality" of sequences refers to the use on any 
one solid support of multiple different defined oligonucleotide/polynucleotides from a 
single EST frx>m a selected library, as well as multiple different defined 
oligonucleotide/polynucleotides from different ESTs from the same library or many 
35 libraries from the same or different tissues, and may also include multiple identical 
copies of defined oligonucleotide/polynucleotides. Ultimately a pluarality has at least 
one oligonucleotide/polynucleotide per expressed gene in the entire organism For 
example, from a library producing about 5,000-10,0(X) ESTs, a single support can 



AVO 95/21944 



PCTAJS95/01863 



include at least about 1-20 defined oligonucleotide/polynucleotides representing every 
EST in that library. The composition of defined oligonucleotide/polynucleotides 
which make up a surface according to this invention may be selected or designed as 
desired. 

5 The term "sample" is employed in the description of this invention in 

several important ways. As used herein, the term "sample" encompasses any cell or 
tissue from an organism. Aiiy desired cell or tissue type m any desired state may be 
selected to form a sample. For example, the sample cell desired may be a human T 
cell; the desired cell type for use in this invention may be a quiescent T cell or an 

10 activated T cell. 

By the phrase "analogous sample" or "analogous cell or tissue" is 
meant that according to this invention when the ESTs which provide the defined 
oligonucleotide/polynucleotides are produced from a cDNA library prepared from a 
single tissue or cell type source sample, e.g., liver tissue of a human, then the samples 

15 used to hybridize to those immobilized defined oligonucleotide/polynucleotides are 
preferably provided by the same type of sample from either a healthy or diseased 
animal, i.e., liver tissue of a healthy human and liver tissue of a diseased or infected 
human or from a human suspected of having that disease or infection. Alternatively, 
if the surface contains defined oligonucleotide/polynucleotides from multiple cells or 

20 tissues, then the "samples" which are hybridized thereto can be but are not limited to 
samples obtained from analogous multiple tissues or cells. 

By the term "detectably hybridizing" means that the sample from the 
healthy organism or diseased or infected organism is contacted with the defined 
oligonucleotide/polynucleotides on the surface for sufficient time to permit the 

25 formation of patterns of hybridization on the surfaces caused by hybridization 
between certain polynucleotide sequences in the samples with the certain immobilized 
defined oligonucleotide/polynucleotides. These patterns are made detectable by the 
use of available conventional techniques, such as fluorescent labelling of the samples. 
Preferably hybridization takes place under stringent conditions, e.g., revealing 

30 homologies of about 95%. However, if desired, other less stringent conditions may 
be selected. Techniques and conditions for hybridization at selected stringencies are 
well known in the art [see, e.g., Sambrook et al, MQl^cul^r Cloning. A Laboratory 
Manual. . Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1989)]. 

35 //. Compositions of The Invention 

The present invention is based upon the use of ESTs from any desired 
cell or tissue in known technologies for oligonucleotide/polynucleotide hybridization. 



8 



wo 95/21944 



PCTAJS95/01863 



A, ESTs 

An EST, as defined above, is for an animal, a sequence from a 
cDNA clone that corresponds to an mRNA. The EST sequences useful in the present 
invention are isolated preferably from cDNA libraries using a rapid screening and 

5 sequencing technique. Custom made cDNA libraries are made using known 
techniques. See, generally, Sambrook et al, cited above. Briefly, mRNA from a 
selected cell or tissue is reverse transcribed into complementary DNA (cDNA) using 
the reverse transcriptase enzyme and made double-stranded using RNase H coupled 
with DNA polymerase or reverse transcriptase. Restriction enzyme sites are added to 

10 the cDNA and it is cloned into a vector. The result is a cDNA library. Alternatively, 
conmiercially available cDNA libraries may be used. Libraries of cDNA can also be 
generated from recombinant expression of genomic DNA using known techniques, 
including polymerase chain reaction-derived techniques. 

ESTs (which can range from about 150 to about 500 nucleotides in 

15 length, preferably about 300 nucleotides) can be obtained through sequence analysis 
from either end of the cDNA insert Desirably, the DNA libraries used to obtain 
ESTs use directional cloning methods so that either the 5' end of the cDNA Qikely to 
contain coding sequence) or the 3* end Oikely to be a non-coding sequence) can be 
selectively obtained. 

20 In general, the method for obtaining ESTs comprises applying 

conventional automated DNA sequencing technology to screen clones, 
advantageously randomly selected clones, from a cDNA library. The cDNA libraries 
from the desired tissue can be preprocessed, or edited, by conventional techniques to 
reduce repeated sequencing of high and intermediate abundance clones and to 

25 maximize the chances of finding rare messages from specific cell populations. 
Preferably, preprocessing includes the use of defined composition prcscreening 
probes, e.g., cDNA corresponding to mitochondria, abundant sequences, ribosomes, 
actins, myelin basic polypeptides, or any otiier known high abundance peptide. These 
prescreening probes used for preprocessing are generally derived from known ESTs. 

30 Other useful preprocessing techniques include subtraction hybridization, which 
preferentially reduces the population of highly represented sequences in the library 
[e.g., see Fargnoli et al, Anal. Biochem. . I£l:364 (1990)] and normalization, which 
results in all sequences being represented in approximately equal proportions in the 
library [Patanjali et al, Proc. Natl. Acad. Sci. USA . M:1943 (1991)]. Additional 

35 prescreening/differential screening approaches are known to those skilled in the art. 

ESTs can then be generated from partial DNA sequencing of the 
selected clones. The ESTs useful in the present invention are preferably generated 
using low redundancy of sequencing, typically a single sequencing reaction. While 



I 



* 



wo 95/21944 PCT/US95/01863 

single sequencing reactions may have an accuracy as low as 90%, this nevertheless 
provides sufficient fidelity for identification of the sequence and design of PGR 
primers. 

If desired, the location of an EST in a full length cDNA is detennined 
5 by analyzing the EST for the presence of coding sequence. A conventional computer 
program is used to predict the extent and orientation of the coding region of a 
sequence (using all six reading frames). Based on this infonnation, it is possible to 
infer the presence of start or istbp codons within a sequence iand whether the sequence 
is completely coding or completely non-coding or a combination of the two. If start 
10 or stop codons are present, then the EST can cover both part of the 5*-untranslated or 
3 -untranslated part of the mRNA (respectively) as well as part of the coding 
sequence. If no coding sequence is present, it is likely that the EST is derived from 
the 3* untranslated sequence due to its longer length and the fact that most cDNA 
library construction methods are biased toward the 3' end of the mRNA. It should be 
15 understood that both coding and non-coding regions may provide ESTs equally useful 

in the described invention. 

A number of specific ESTs suitable for use in the present 
invention are described above Adams et al (supra), which may be incorporated by 
reference herein, to describe non-essential examples of desirable ESTs. Other ESTs 
20 exist in the art which may also be useful in this invention, as will ESTs yet to be 

developed by these known techniques. 

B. Preparing the Solid Support of the Invention 

Oligonucleotide sequences which are fragments of defined 
sequence are derived from each EST by conventional means, e.g., conventional 

25 chemical synthesis or recombinant techniques. Each defined 

oligonucleotide/polynucleotide sequence as described above is a fragment, can be, but 
is not necessarily an anti-sense fragment, of an EST isolated from a DNA library 
prepared from a selected cell or tissue type from a selected animal. For use in the 
present invention, it is presentiy preferred that the defined 

30 oligonucleotide/polynucleotide sequences are 20-25mers. As described above, for 
each EST a number of such 20-25mers may be generated. The lengths may vary as 
described above as well as the composition. For example 

oligonucleotide/polynucleotides can be modified based on the Oligo 4.0 or simiolar 
programs to predict hybridization potential or to include modifieid nucleotides for the 

35 reasons given above. It is alos appreciated that large DNA segments may be 
employed including entire ESTs or even full length genes particular when inserted 
into cloning vectors. 

10 



wo 95/21944 PCTAJ 595/0 1863 



A plurality of these defined oligonucleotide/polynucleotide 
sequences are then attached to a selected solid support conventionally used for the 
attachment of nucleotide sequences again by known means. In contrast to other 
technologies available in the art, this support is designed to contain defined, not 
random, oligonucleotide/polynucleotide sequences. The EST fragments, or defined 
oligonucleotide/polynucleotide sequences, immobilized on the solid support can 
include fragments of one or more ESTs from a library of at least one selected tissue 
or cell sample bra he^ of the animal having 

a disease, at least one analogous sample of the animal infected with a pathogen, and 
10 any combination thereof. 

Numerous conventional methods are employed for attaching 
biological molecules such as oligonucleotide/polynucleotide sequences to surfaces of 
a variety of solid supports. See, e.g., Affinitv Techniques. Enzvme Purification: Part 
B. Methods in Enzvmology . Vol, 34, ed. W.B. Jakoby, M. Wilcheck, Acad. Press, 
15 NY (1974); Immobilized Biochemicals and Affinitv Chromatographv. Advances in 
Experiment al Medicine and Biologv . vol. 42, ed. R. Dunlap, Plenum Press, NY 
(1974); U. S. Patent No. 4,762,881; U. S. Patent No. 4,542,102; European Patent 
Publication No. 391,608 (October 10, 1990); U. S. Patent No. 4,992,127 (Nov. 21, 
1989). 

20 One desirable method for attaching 

oligonucleotide/polynucleotide sequences derived from ESTs to a solid support is 
described in International Application No. PCTAJS90/06607 (published May 30, 
1991). Briefly, this method involves forming predefined regions on a surface of a 
solidsupport, where the predefined regions are capable of immobilizing ESTs. The 

25 methods make use of binding substances attached to the surface which enable 
selective activation of the predefined regions. Upon activation, these binding 
substances become capable of binding and immobilizing 
oligonucleotide/polynucleotides based on EST or longer gene sequences. 

Any of the known solid substrates suitable for binding 

30 oligonucleotide/polynucleotides at pre-defined regions on the surface thereof for 
hybridization and methods for attaching the oligonucleotide/polynucleotides thereto 
may be employed by one of skill in the art according to this invention. Similarly, 
known conventional methods for making hybridization of the immobilized 
oligonucleotide/polynucleotides detectable, e.g., fluorescence, radioactivity, 

35 photoactivation, biotinylation, solid state circuitry, and the like may be used in this 
invention. 

Thus, by resorting to known techniques, the invention provides 
a composition suitable for use in hybridization which consists of a surface of a solid 

11 



■ 



wo 95/21944 PCTAJS95/01863 



.'.i-ttl --«,» 



support on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization. For example, 
one composition of this invention is a solid support on which are immobilized dligos 
of EST fragments from a library constructed from a single cell type, e.g., a human 
5 stem cell, or a single tissue, e.g., human liver, from a healthy human. Still another 
composition of this invention is another solid support on which are inunobilized 
oligos of EST fragments from a library constructed from a single cell type or a tissue 
from a human having a selected disease or prcdispositon to a selected disease, e.g., 
liver cancer. 

10 Another embodiment of the compositions of this invention 

include a single solid support having oligonucleotides of ESTs from both single cell 
or single tissue libraries from both a healthy and diseased human. Still other 
embodiments include a single suppon on which are immobilized oligos of EST 
fragments fi-om more than one tissue or cell library from a healthy human or a single 

15 support on which are immobilized more than one tissue or cell library from both 
healthy and diseased animals or humans. A preferred composition of this invention is 
anticipated to be a single support containing oligos of ESTs for all known cells and 
tissues from a selected organism. 

20 ///. The Methods of the Invention 

A. Identification of Genes 

The present invention employs the compositions described 
above in methods for identifying genes which are differentially expressed in a normal 
healthy organism and an organism having a disease or infection. These methods may 

25 be employed to detect such genes, regardless of the state of knowledge about the 
function of the gene. The method of this invention by use of the compositions 
containing multiple defined EST fragments from a single gene as described above is 
able to detect levels of expression of genes or in other cases simply the expression or 
lack thereof, which differ between normal, healthy organisms and organisms having a 

30 selected disease, disorder or infection. 

One such method employs a first surface of a solid support on 
which is immobilized at pre-defined regions thereon a plurality of defined 
oligonucleotide/polynucleotide sequences, described above, of ESTor longer gene 
fragment isolated from a cDNA library prepared from at least one selected tissue or 

35 cell sample of a healthy animal (the "healthy test surface") and a second such surface 
on which is immobilized at pre-defined regions a plurality of defined 
oligonucleotide/polynucleotide sequences of ESTor longer gene fragment isolated 
from at least one analogous tissue of an animal having a selected disease (the "disease 

12 



wo 95/21944 PCTAJS95/01863 



test surface"). These test surfaces may be standardized for the selected animal or 
selected cell or tissue sample from that animal (i.e., they are prescreened for 
polymorphisms in the species population). 

Polynucleotide sequences are then isolated from mRNA and/or 
5 cDNA from a biological sample from a known healthy animal ("healthy control") and 
a second sample is similarly prepared from a sample from a known diseased animal 
("disease sample"). These two samples are desirably selected from the cell or tissue 
analogous to that which provided the immobilized oligonucleotide/polynucleotides. 

According to the method the healthy control sample is 

10 contacted with one set of the healthy test surface and the disease test surface 
described above for a time sufficient to permit detectable hybridization to occur 
between the sample and the immobilized defined oligonucleotide/polynucleotides on 
each surface. The results of this hybridization are a first hybridization pattern formed 
between the nucleotides of healthy control and the healthy test surface and a second 

15 hybridization pattern formed between the nucleotides of healthy control sample and 
the disease test surface. 

In a similar manner, the disease sample is detectably hybridized 
to another set of healthy test and disease test surfaces, forming a third hybridization 
pattern between the disease sample and healthy test surface and a fourth hybridization 

20 pattern between the disease sample and the disease test surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 
between the healthy control and the disease sample by the presence of differences in 
the hybridization patterns at pre-defined regions. The 

25 oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTor longer gene 
fragment from which the oligonucleotide/polynucleotides are obtained. 

In another embodiment of the method of this invention, the 
same process is employed, with the exception that plurality of defined 

30 oligonucleotide/polynucleotide sequences forming the healthy test sample and the 
disease test sample surfaces are immobilized on a single solid support. For example, 
each fragment of an EST or longer gene fragment on the surface is isolated from at 
least two cDNA libraries prepared from a selected cell or tissue sample of a healthy 
animal and an analogous selected cell or tissue sample of an animal having a disease. 

35 According to this embodiment, the healthy control sample is 

detectably hybridized to a copy of this single solid surface, forming one hybridization 
pattern with oligonucleotide/polynucleotides associated with both the healthy and 
diseased animal. Similarly, the disease sample is detectably hybridized to a second 

13 



wo 95/21944 



PCTAUS95/01863 



copy of this single solid surface, forming one hybridization pattern with 
oligonucleotide/polynucleotides associated with both the healthy and diseased animal. 

Comparing the two hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 

5 between the healthy control and the disease sample by, the presence of differences in 
the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the coto ESTor longer gene 

fragment from which the oligonucleotide/polynucleotides are obtained. 

10 The identification of one or more ESTs as the source of the 

defined oligonucleotide/polynucleotide which produced a "difference" in 
hybridization patterns according to these methods permits ready identification of the 
gene from which those ESTs were derived. Because oligonuleotides are of sufficient 
length that they will hybridize under stringent conditions only with a RNA/cDNA for 

15 that gene to which they correspond, the oligo can be used to identify the EST and in 
turn the clone from which it was derived and by subsequent cloning, obtain the 
sequence of the full-length cDNA and its genomic counterparts, i.e., the gene, from 
which it was obtained. 

In other words, the ESTs identified by the method of this 

20 invention can be employed to determine the complete sequence of the mRNA, in the 
form of transcribed cDNA, by using the EST as a probe to identify a cDNA clone 
corresponding to a full-length transcript, followed by sequencing of that clone. The 
EST or the full length cDNA clone can also be used as a probe to identify a genomic 
clone or clones that contain the complete gene including regulatory and promoter 

25 regions, exons« and introns. 

It should be appreciated that one does not have to be restricted 
in using ESTs from a particular tissue from which probe RNA or cDNA is obtained, 
rather any or all ESTs (known or unknown) may be placed on the support 
Hybridization will be used a form diagnostic patterns or to identifiy which particular 

30 EST is detected. For example, all known ESTs from an organism are used to produce 
a "master" solid support to which control sample and disease samples are alternately 
hybridized. One then detects a pattern of hybridization associated with the particular 
disaease state which then forms the basis of a diagnostic test or the isolation of 
disease specific ESTs from which the intact gene may be cloned and sequenced 

35 leading uiltimately to a defined therapuetic target. 

Methods for obtaining complete gene sequences from ESTs are 
well-known to those of skill in the art. See, generally, Sambrook et al, cited above. 
Briefly, one suitable method involves purifying the DNA from the clone that was 

14 ^ . 



wo 95/21944 



PCT/US95/01863 



sequenced to give the EST and labeling the isolated insert DNA. Suitable labeling 
systems are well known to those of skill in the art [see, eg. Basic Methods in 
Molecular Biology, L. G. Davis et al, ed., Elsevier Press, NY (1986)]. The labeled 
EST insert is then used as a probe to screen a lambda phage cDNA library or a 
5 plasmid cDNA library, identifying colonies containing clones related to the probe 
cDNA which can be purified by known methods. The ends of the newly purified 
clones are then sequenced to identify full length sequences and complete sequencing 
of full length clones is performed by enzymatic digestion or primer walking. A 
similar screening and clone selection approach can be applied to clones from a 
10 genomic DNA library. 

Additionally, an EST or gene identified by this method as 
associated with inherited disorders can be used to determine at what stage during 
embryonic development die selected gene from which it is derived is developed by 
screening embryonic DNA libraries from various stages of development, e.g. 2-cell, 
15 8-cell, etc., for the selected gene. As has been mentioned above, the invention may 
be applied in addtional temporal modes for monitoring the progression of a disease 
state, the efficacy of a particular treatment modality or the aging process of an 
individual. 

Thus, the methods of this invention permit the identification, 
20 isolation and sequencing of a gene which is differentially expressed in a selected 
diseaseAnfection. As described in more detail below, the identified gene may then be 
employed to obtain any protein encoded thereby, or may be employed as a target for 
diagnostic methods or therapeutic approaches to the treatment of the disease, 
including, e.g., drug development 
25 The same methods as described above for the identification of 

genes, including genes of unknown function, which are differentially expressed in a 
disease state, may also be employed to identify other genes of interest. For example, 
another embodiment of this invention includes a method for identifying a gene of a 
pathogen which is expressed in a biological sample of an animal infected with that 
30 pathogen or the gene of the host which is altered in its expression as a result of the 
infection. 

One such method employs a healthy test surface as described 
above, employing defined oligonucleotide/polynucleotides from a sample of a 
healthy, uninfected animal. The second such surface has immobilized at pre-defined 
35 regions thereon a plurality of defined oligonucleotide/polynucleotide sequences of 
ESTs isolated from at least one analogous tissue or cell sample of an infected animal 
(the "infection test surface"). Polynucleotide sequences are isolated from a biological 
sample from a healthy animal ("healthy control") and a second sample is similarly 

15 



wo 95/21944 



PCTAJS95/01863 



prepared from an animal infected with the selected pathogen ("infection sample"). 
These two samples are desirably selected from the cell or tissue analogous to that 
which provided the immobilized oligonucleotide/polynucleotides. It would also be 
possible to provide samples from the nucleic acid of the pathogen itself. 

5 According to the method the healthy control sample is 

contacted with one set of the healthy test surface and the infection test surface 
described above for a time sufficient to permit detectable hybridization to occur 
between the sample and the immobilized defined oligonucleotide/polynucleotides on 
each surface. The results of this hybridization are a first hybridization pattern formed 

10 between the nucleotides of healthy control and the healthy test surface and a second 
hybridization pattern formed between the nucleotides of healthy control sample and 

the infection test surface. 

In a similar manner, the infection sample is detectably 

hybridized to another set of healthy test and infection test surfaces, forming a third 

15 hybridization pattern between the infection sample and healthy test surface and a 

fourth hybridization pattern between the infection sample and the infection test 

surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 

20 between the healthy animal and the animal infected with the pathogen by the presence 
of differences in the hybridization patterns at pre-defined regions. As mentioned 
differential expression is not required and simple qualitative analysis is possible by 
reference to gene expression which is simply present or absent. 

A second embodiment of this method parallels the second 

25 embodiment of the method as applied to disease above, i.e., the same process is 
employed, with the exception that plurality of defined oligonucleotide/polynucleotide 
sequences forming the healthy test sample surface and the infection test sample 
surface are immobilized on a single solid support. The resulting first hybridization 
pattern (healUiy control sample with healthy/infection test sample) and second 

30 hybridization pattern (infection sample with healthy/infection test sample) permits 
detection of those defined oligonucleotide/polynucleotides which are differentially 
expressed between the healthy control and the infection sample by the presence of 
differences in the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 

35 differences may be readily identified witii the corresponding ESTs from which the 
oligonucleotide/polynucleotides are obtained. 

As described above for the methods for identifying differential 
gene expression between diseased and healthy animals, the 

16 



wo 95/21944 



PCTAJS95/01863 



oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTs from which the 
oligonucleotide/polynucleotide sequences are obtained and the genes expressed by the 
pathogen identified for similar purposes. Other embodiments of these methods may 
5 be developed with resort to the teaching herein, by altering the samples which provide 
the defined oligonucleotide/polynucleotides. For example, an EST, identified with a 
differentially expressed gene by the method of this invention is also useful in 
detecting genes expressed in the various stages of an pathogen's development, 
particularly the infective stage and following the cours of drug treatment and 

10 emergence of resistant variants. For example, employing the techniques described 
above, the EST can be used for detecting a gene in various stages of the parasitic 
Plasmodium species life cycle, which include blood stages, liver stages, and 
gametocyte stages. 

B. Diagnostic Methods 

15 In addition to use of the methods and compositions of this 

invention for identifying differentially expressed genes, another embodiment of this 
invention provides diagnostic methods for diagnosing a selected disease state, or a 
selected state resulting from aging, exposure to drugs or infection in an animal. 
According to this aspect of the invention, a first surface, described as the healthy test 

20 surface above, and a second surface, described as the disease test surface or infection 
test surface, are prepared depending on the disease or infection to be diagnosed. The 
same processes of detectable hybridization to a first and second set of these surfaces 
with the healthy control sample and diseaseAnfection sample are followed to provide 
the four above-described hybridization patterns, i.e., healthy control sample with 

25 healthy test sinface; healthy control sample with diseaseAnfection test surface; 
disease/infection sample with healthy test surface; and disease/infection sample with 

disease^nfection test surface. 

The diagnosis of disease or infection is provided by comparing 
the four hybridization patterns. Substantial differences between the first and third 
30 hybridization patterns, respectively, and the second and fourth hybridization patterns, 
respectively, indicate the presence of the selected disease or infection in said animal. 
Substantial similarities in the first and third hybridization patterns and second and 
fourth hybridization patterns indicates the absence of disease or infection. 

A similar embodiment utilizes the single surface bearing both 
35 the healthy test surface defined oligonucleotide/polynucleotides and die 
disease/infection test surface defined oligonucleotide/polynucleotides as described 
above. Parallel process steps as described above for detection of genes differentially 
expressed in disease and infected states are followed, resulting in a first hybridization 

17 



wo 95/21944 PCT/US95/01863 

pattern (healthy control sample with single healthy and disease/infection test sample) 
and a second hybridization pattern (disease/infection sample with another copy of the 
single healthy and disease/infection test sample). 

Diagnosis is accomplished by comparing the two hybridization 

5 patterns, wherein substantial differences between the first and second hybridization 
patterns indicate the presence of the selected disease or infection in the animal being 
, tested. Subjst^tially siniilar^ firsts an^ ^^^^^ patterns indicate the 

absence of disease or infection. This like many of the foregoing embodiments may 
use known or unknown ESTs derived from many libraries. 

10 C. Other Methods of the Invenrion 

As is obvious to one of skill in the art upon reading this 
disclosure, the compositions and methods of this invention may also be used for other 
sinoilar purposes. For example, the general methods and compositions may be 
adapted easily by manipulation of the samples selected to provide the standardized 

15 defined oligonucleotide/polynucleotides, and selection of the samples selected for 
hybridization thereto. One such modification is the use of this invention to identify 
cell markers of any type, e.g., markers of cancer cells, stem cell markers, and the like. 
Another modification involves the use of the method and compositions to generate 
hybridization patterns useful for forensic identification or an 'expression fingerprint' 

20 of genes for identification of one member of a species from another. Similarly, the 
methods of this invention may be adapted for use in tissue matching for 
transplantation purposes as well as for molecular histology, i.e., to enable diagnosis of 
disease or disorders in pathology tissue samples such as biopsies. Still another use of 
this method is in monitoring the effects of development and aging upon the gene 

25 expression in a selected animal, by preparing surfaces bearing 
oligonucleotide/polynucleotides prepared from samples of standardized younger 
members of the species being tested. Additionally the patient can serve as an internal 
control by virtue of having the method applied to blood samples every 5-10 years 
during his lifetime. 

30 Still another intriguing use of this method is in the area of 

monitoring the effects of drugs on gene expression, both in laboratories and during 
clinical trials with animal, especially humans. Because the method can be readily 
adapted by altering the above parameters, it can essentially be employed to identify 
differentially expressed genes of any organism, at any stage of development, and 

35 under the influence of any factor which can affect gene expression. 



18 



wo 95/21944 



PCT/US95/01863 



/V. The Genes and Proteins Identified 

Application of the compositions and methods of this invention as 
above described also provide other compositions, such as any isolated gene sequence 
which is differentially expressed between a normal healthy animal and an animal 
5 having a disease or infection. Another embodiment of this invention is any isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal. Simil^ly an embodimept pf this inyention is any geqe sequence identified by 
the methods described herein. 

These gene sequences may be employed in conventional methods to 

10 produce isolated proteins encoded thereby. To produce a protein of this invention, 
the DNA sequences of a desired gene identified by the use of the methods of this 
invention or portions thereof are inserted into a suitable expression system. 
Desirably, a recombinant molecule or vector is constructed in which the 
polynucleotide sequence encoding the protein is operably linked to a heterologous 

15 expression control sequence permitting expression of the human protein. Numerous 
types of appropriate expression vectors and host cell systems are known in the art for 
mammalian (including human) expression, insect, e.g., baculovirus expression, yeast, 
fungal, and bacterial expression, by standard molecular biology techniques. 

The transfection of these vectors into appropriate host cells, whether 

20 mammalian, bacterial, fungal, or insect, or into appropriate viruses, can result in 
expression of the selected proteins. Suitable host cells or cell lines for transfection, 
and viruses, as well as methods for the construction and transfection of such host cells 
and viruses are well-known. Suitable methods for transfection, culture, amplification, 
screening, and product production and purification are also known in the art 

25 The genes and proteins identified by this invention can be employed, if 

desired in diagnostic compositions useful for the diagnosis of a disease or infection 
using conventional diagnostic assays. For example, a diagnostic reagent can be 
developed which detectably targets a gene sequence or protein of this invention in a 
biological sample of an animal. Such a reagent may be a complementary nucleotide 

30 sequence, an antibody (monoclonal, recombinant or polyclonal), or a chemically 
derived agonist or antagonist. Alternatively, the proteins and polynucleotide 
sequences of this invention, fragments of same, or complementary sequences thereto, 
may themselves be useful as diagnostic reagents for diagnosing disease states with 
which the ESTs of the invention are associated. These reagents may optionally be 

35 labelled using diagnostic labels, such as radioactive labels, colorimetric enzyme label 
systems and the like conventionally used in diagnostic or therapeutic methods, e.g. 
Northern and Western blotting, antigen-antibody binding and the like. The selection 
of the appropriate assay format and label system is within the skill of the art and may 

19 



wo 95/21944 PCTAJS95/01863 



readily be chosen without requiring additional explanation by resort to the wealth of 
art in the diagnostic area. 

Additionally, genes and proteins identified according to this invention 

i 

may be used therapeutically. For example, the EST-containing gene sequences may 

5 be useful in gene therapy, to provide a gene sequence which in a disease is not 
properly or sufficiently expressed: In such a method, a selected gene sequence of this 
invention is introduced into a suitable vector or other delivery system for delivery to a * 
cell containing a defect in the selected gene. Suitable delivery systems are well 
known to those of skill in the art and enable the desired EST or gene to be 

10 incorporated into the target cell and to be translated by the cell. The EST or gene 
sequence may be introduced to mutate the existing gene by recombination or provide 
an active copy thereof in addition to the inactive gene to replace its function. 

Alternatively, a protein encoded by an EST or gene of the invention 
may be useful as a therapeutic reagent for delivery of a biologically active protein, 

15 particularly when the disease state is associated with a deficiency of this protein. 
Such a protein may be incorporated into an appropriate therapeutic formulation, alone 
or in combination with other active ingredients. Methods of formulating such 
therapeutic compositions, as well as suitable pharmaceutical carriers, and the like, are 
well known to those of skill in the art. Still an additional method of delivering the 

20 missing protein encoded by an EST, or the gene from which a selected EST was 
derived, involves expressing it direcdy in vivo. Systems for such in vivo expression 
are well known in the art. 

Yet another use of the ESTs, genes identified according to the methods 
of this invention, or the proteins encoded thereby is a target for the screening and 

25 development of natural or synthetic chemical compounds which have utility as 
therapeutic drugs for the treatment of disease states associated with the identified 
genes and ESTs derived therefrom. As one example, a compound capable of binding 
to such a protein encoded by such a gene and either preventing or enhancing its 
biological activity may be a useful drug component for the treatment or prevention of 

30 such disease states. 

Conventional assays and techniques may be used for the screening and 
development of such drugs. As one example, a method for identifying compounds 
which specifically bind to or inhibit or activate proteins encoded by these gene 
sequences can include simply the steps of contacting a selected protein or gene 

35 product, with a test compound to permit binding of the test compound to the protein; 
and determining the amount of test compound, if any, which is bound to the protein. 
Such a method may involve the incubation of the test compound and the protein 
immobilized on a solid support. Still other conventional methods of drug screening 

20 



wo 95/2 1 944 PCT/US95/0 1 863 

can involve employing a suitable computer program to determine compounds having 
similar or complementary chemical structures to that of the gene product or portions 
thereof and screening those compounds either for competitive binding to the protein 
to detect enhanced or decreased activity in the presence of the selected compound. 
5 Thus, through use of such methods, the present invention is anticipated 

to provide compounds capable of interacting with these genes, ESTs, or encoded 
^ proteins, or fragments thereof, and either enhancing or decreasing the biological 

activity, as desired. Such compounds are believed to be encompassed by this 
invention. 

10 Numerous modifications and variations of the present invention are 

included in the above-identified specification and are expected to be obvious to one of 
skill in the art. Such modifications and alterations to the compositions and processes 
of the present invention are believed to be encompassed in the scope of the claims 
appended hereto, 

15 



21 



wo 95/21944 PCT/US95/01863 

WHAT IS CLAIMED IS: 

1. A method for identifying genes which are differentially expressed in 
two different pre-determined states of an organism comprising: 
5 a. providing a first surface on which is immobilized at pre-defined 

regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample in a first 
10 state and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a firagment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 

15 prepared from at least one selected cell, tissue, organ or organism sample in a second 
state and present in excess relative to the polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a said organism in said first 
state, said sample selected from sources analogous to the sources of step (a), said 

20 hybridization sufficient to fonn a first and second hybridization pattern on each said 
first and second surface, 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from said organism in said second 
state, said sample selected from sources analogous to the sources of step (c), said 

25 hybridization sufficient to form a third and fourth hybridization pattern on each said 
first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said first and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

30 f. identifying the oligonucleotide/polynucleotides on each surface 

which correspond to said pattern differences and the corresponding ESTs or larger 
gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

35 • 



22 



wo 95/21944 PCT/US95/01863 



2. The methcxl according to Claim 1 wherein said first and second states are 
respectively healthy and disease; pathogen uninfected and pathogen infected; a first 
progression state and a second progression of a disease or infection; a first treatment 
state and a second treatment state of a disease or infection; or a first developmental 

5 and a second developmental state. 

3. The method according to Claim 1 wherein said organism is a plant or an 

animal. 

10 4. The method according to Qaim 3 wherein said aniaml is a human. 

5. A method for identifying genes which are differentially expressed in a 
normal healthy animal and an animal having a disease comprising: 

a. providing a first surface on which is immobOized at pre- 
15 defined regions on said surface a plurality of defined oligonucleotide^lynucleotide 

sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 
from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample in a healthy animal and present in excess relative to the polynucleotide to be 
20 hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 

25 from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample from an animal having said disease and present in excess relative to the 
polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a healthy ariimal, said sample 

30 selected from sources analogous to the sources of step (a), said hybridization 
sufficient to form a fu'St and second hybridization pattern on each said first and 
second surface, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 
hybridization pattern on each said first and second surface; 



23 



wo 95/21944 PCT/US95/01863 



d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from an animal having said disease, 
said sample selected from a cell or tissue sample analogous to the sample of step (c), 
said hybridization sufficient to form a third and fourth hybridization pattern on each 

5 said first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said furst and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

f . identifying the oligonucleotide/polynucleotides oh each surface 
10 which correspond to said pattern differences and the corresponding ESTs or larger 

gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

15 6. A method for identifying genes which are differentially expressed in a 

normal healthy animal and an animal having a disease comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 

20 an entire EST a fragment of a gene or an entiire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to the polynucleotide to 
be hybridized; 

25 b. detecubly hybridizing to a first copy of said surface 

polynucleotide sequences isolated from a healthy animal, said sample selected from a 
cell or tissue sample analogous to the sample of step (a), said hybridization sufficient 
to form a furst hybridization pattern on said surface; 

c. detectably hybridizing to a second copy of said surface 
30 polynucleotide sequences isolated from an animal having said disease, said sample 

selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes 
differentially expressed in a disease state are identified by the presence of differences 

35 in the hybridization patterns at pre-defined regions; 



24 



95/21944 



PCT/US95/01863 



e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the ohgonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived, 

7. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide^lynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a 
healthy, uninfected animal and present in excess relative to the polynucleotide to be 
hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from at least one 
selected cell, tissue, organ or organism sample of an infected animal; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form first and second hybridization patterns on each said 
first and second surface, 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form third and fourth hybridization patterns on each said 
first and second surface, 

e. comparing the four hybridization patterns, wherein genes of 
said pathogen which are expressed in an infected animal are identified by the 
presence of differences in the hybridization patterns at pre-defined regions; 

f identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived. 



25 



wo 95/21944 PCT/US95/01863 

8. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 

5 sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to die polynucleotide to 
10 be hybridized 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a first hybridization pattern on said surface; 

15 c. detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes of 
20 said pathogen which are expressed in an infected animal are identified by the 

presence of differences in the hybridization patterns at pre-defined regions; 

e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides arc obtained, whereby identification of the EST 

25 permits identification of the gene from which the ESTs were derived. 

9. A composition suitable for use in hybridization comprising a solid 
surface on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization, each sequence 

30 selected from the group consisting of a fragment of an EST, an entire EST a fragment 
of a gene or an entire gene isolated from a DNA library prepared from the group 
selected from at least one selected cell, tissue, organ or organism sample of a healthy 
animal, at least one analogous sample of said animal having a disease, at least one 
analogous sample of said animal infected with a microbial pathogen, and any 

35 combination thereof. 



26 



wo 95/21944 



PCTAJS95/01863 



10. An isolated gene sequence which is differentially expressed in a 
normal healthy animal and an animal having a disease, identified by the method of 

r 

claim 1. 

5 11. An isolated pathogen gene sequence which is expressed in tissue or 

cell samples of an infected animal identified by the method of claim 7. 

12. A diagnostic composition useful for the diagnosis of a disease 
comprising a reagent capable of detectably targeting a gene sequence of claim 10 in a 

10 biological sample of an animal. 

13. A diagnostic composition useful for the diagnosis of infection by a 
pathogen comprising a reagent capable of detectably targeting a gene sequence of 
claim 1 1 in a biological sample of an animal. 

15 

14. An isolated protein produced by expression of a gene sequence of 
claim 10. 

15. An isolated pathogen protein produced by expression of a gene 
20 sequence of claim 11. 

16. A therapeutic composition comprising a protein or fragment thereof 
selected from the group consisting of a protein of claim 10 and a protein of claim 15. 

25 17. A method for diagnosing a selected disease or infection in an animal 

comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 

30 an entire EST a fragment of a gene or an entire gene, isolated fi'om a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a healthy 
animal and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 

35 sequences, each sequence comprising a fragment of an EST isolated from at least one 
said tissue of an animal having said disease; 

27 



wo 95/21944 PCT/US95/01863 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from a 
healthy animal, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 

5 hybridization pattern on each said first and second surface; 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (c), said hybridization sufficient to form a third and 

10 fourth hybridization pattern on each said first and second surface; 

e. comparing the four hybridization patterns, wherein substantial 
differences between the first and third hybridization patterns and the second and 
fourth hybridization patterns indicates the presence of said selected disease or 
infection in said animal, and substantial similarities in said first and third 

15 hybridization patterns and second and fourth hybridization patterns indicates the 
absence of disease or infection. 

18. A method for diagnosing a selected disease or infection in an animal 
comprising: 

20 a. providing a surface on which is immobilized at pre-defined 

legions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence comprising a fragment of an EST isolated from a DNA 
library prepared from the group consisting of a selected cell or tissue sample of a 
healthy animal and an analogous selected cell or tissue sample of an animal having 

25 said disease; 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a first hybridization pattern on said surface; 

30 c, detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a DNA library prepared from a sample from 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (a), said hybridization sufficient to form a second 
hybridization pattern on said surface; 

35 d. comparing the two hybridization patterns, wherein substantial 

differences between the first and second hybridization patterns indicates the presence 
of said selected disease or infection in said animal, and substantial similarities in said 
first and second hybridization patterns indicates the absence of disease or infection. 

28 



ESfTERNATIONAL SEARCH REPORT 



International application No. 
PCTAJS95/01863 



A, CLASSinCATlON OF SUBJECT MATTER 

IPC(6) :C12Q 1/68 
US CL :435/6 

According to International Patent Classirication (TPC) or to both national classification and IPC 



B. HELDS SEARCHED 



Minimiim documentation searched (classiflcation system foUowed by classification symbols) 
U.S. : 435/6 



Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practicable, search terms used) 
APS. CAS. BIOSIS 



C. 



►CUMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document, with indication, where appropriate, of the relevant passages 



Relevant to claim No. 



ANALYTICAL BIOCHEMISTRY, VOLUME 187, ISSUED 1990, 
FARGNOLI ET AL, "LOW-RATIO HYBRIDIZATION 
SUBTRACTION", PAGES 364-373, SEE ENTIRE DOCUMENT. 

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES 
USA, VOLUME 88, ISSUED MARCH 1991, PATANJALI ET 
AL, "CONSTRUCTION OF A UNIFORM-ABUNDANCE 
(NORMALIZED) CDNA LIBRARY", PAGES 1943-1947, SEE 
ENTIRE DOCUMENT. 

SCIENCE, VOLUME 245, ISSUED 29 SEPTEMBER 1989, 
OLSON ET AL. "A COMMON LANGUAGE FOR PHYSICAL 
MAPPING OF THE HUMAN GENOME", PAGES 1434-1435, 
SEE ENTIRE DOCUMENT. 



1-18 



1-18 



1-18 



I x| Further documents are listed in the continuation of Box C. [ | See patent family annex. 



•A- 



Speciai calcgoiici of cited documenli: 

documcmdefiDini the gcnerml itile of the wt which m oot oooiidefed 
lo be of pwtkuhr relevaooe 



earticT documem publiriied oo or after the satematioiial fUing dale 



documait whidi may throw doubts oo priority ctaiin(a) or which li 
ched to eitabUib the pubUcatioci date of aootfier chatioo or other 
apecial rcaioo (aa apecified) 

docuDCOt refcnring lo ao oral diackwure. uac. exhibitioo or other 



document published prior lo the hiteniatMmal filing date but later tbaa 
the priority dale 



later document publiibed after the international Tding date or priority 
dat£ aod not in conflict with the application but cited lo undentand the 
principle or theory underlying the invention 

document of particular relevance: the claimed inveotion cannot be 
oooaidcrod novel or cannot be cooiidered to involve an inventive step 
when the document is takes akme 

document of particular relevance: the claimed invention cannot be 
conaidered lo involve an inventive step when the document in 
combined with one or more other such documcnu, such combination 
being obvious to a person skilled in the art 

document member of the aamc patent family 



Date of the actual completion of the international search 



03 APRIL 1995 



Name tnd mailing address of the ISAAJS 
CoomBnioaer of Paieata uad Trtdemarfci 
Box fCT 

Wailui«too, D.C. 20231 
FacsimUe No. (703) 305-3230 



Date of mailing of the international search report 

17MftYJ995 - 



Authorized officer 

EGGERTON CAMPBELL 
Telephone No. (703) 308-0196 



PRPIJ. ^ ( 



Form PCT/lSA/210 (second shect)(July 1992)* 



INTERNATIONAL SEARCH REPORT 



International application No. 
PCT/US95/01863 



C (ConiinuaUon). DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 


Citation of document, with indication, where appropriate, of the relevant passages 


Relevant to claim No. 


Y 


SCIENCE, VOLUME 252, ISSUED 21 JUNE 1991, ADAMS ET 
AL, "COMPLEMENTARY DNA SEQUENCING: EXPRESSED 
SEQUENCE TAGS AND HUMAN GENOME PROJECT", 
PAGES 1651-1656, SEE EN i'lRE DOCUMENT. 


1-18 



Form PCT/ISA/210 (continuation of second 8hcel)(Ju}y 1992>* 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 

Intemational Bureau 




Reference 2 of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 



PCX 

INTERNATIONAL APPUCATION PUBUSHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification ^ : 
C12Q 1/68, G06F 15/00 



Al 



(11) International Publication Number: 
(43) International Publication Date: 



WO 95/20,681 

3 August 1995 (03.08.95) 



(21) International Application Number: 



PCT/US95/01160 



(22) International Filing Date: 



27 January 1995 (27.01.95) 



(30) Priority Data: 

08/187.530 
08/282.955 



27 January 1994(27.01.94) 
29 July 1994 (29.07.94) 



US 
US 



(71) Applicant: INCYTE PHARMACEUTICALS, INC. [US/US]; 
3330 Hillview Avenue, Palo Alto. CA 94304 (US). 



(81) Designated States: AM. AU. BB. BG, BR. BY. CA. CN. C2. 
EE. H. GB. HU. JP. KG. KP. KR, K2. LK. LR. LT. LV. 
MD. MG, MN. MX, NO, NZ, PL, RO, RU, SI, SK, TJ. TT. 
UA, UZ, VN, European patent (AT, BE, CH. DE. DK, ES. 
FR. GB. GR. IE, IT, LU, MC, NL, PT. SE), OAPI patent 
(BF. BJ. CF, CG. CI, CM, GA. GN. ML, MR, NE, SN. TD. 
TG), ARIPO patent (KE, MW. SD, SZ). 



Published 

With international search report. 



(72) Inventors: SEILHAMER. Jeffrey. J.; 12555 La Cresta. Los 
Altos Hills. CA 94022 (US). SCOTT, Randal. W.; 13140 
Sun-Mor. Mountain View, CA 94040 (US). 

(74) Agents: CAGE, Kenneth, L. et al.; Willian Brinks Hofer Gilson 
& Uone, 2000 K Street. N.W., Suite 200. Washington, DC 
20006-1809 (US). 



(54)Titie: COMPARATIVE GENE TRANSCRIPT ANALYSIS 



(57) Abstract 

A method and system for quantifying the relative abundance of gene transcripts in a biological specimen. One embodiment of the 
method generates high-throughput sequence-specific analysis of multiple RNAs or their corresponding cDNAs (gene transcript imaging 
analysis). Another embodiment of the method produces a gene transcript imaging analysis by the use of high-throughput cDNA sequence 
analysis. In addition, the gene transcript imaging can be used to detect or diagnose a particular biological state, disease, or condition 
which is correlated to the relative abundance of gene transcripts in a given cell or population of cells. TTie invention provides a method 
for comparing the gene transcript image analysis from two or more different biological specimens in order to distinguish between the two 
specimens and identify one or more genes which are differentially expressed between the two specimens. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international 
applications under the PCT. 



AT 


Austria 


GB 


United Kingdom 


MR 


Mauritania 


AU 


AustzBlia 


GE 


Georgia 


MW 


Malawi 


BB 


Barbados 


GN 


Guinea 


NE 


Niger 


BE 


Belgium 


GR 


Greece 


NL 


Netherlands 


BF 


Burkina Faso 


HU 


Hungary 


NO 


Norway 


BG 


Bulgaria 


IE 


Ireland 


NZ 


New Zealand 


BJ 


Benin 


IT 


Italy 


PL 


Poland 


BR 


Brazil 


JP 


SBpan 


PT 


Portugal 


BY 


Belarus 


KE 


Kenya 


RO 


Rnnania 


CA 


Canada 


KG 


Kyrgystan 


RU 


Russian Federation 


CF 


. Centra) African Republic 


KP 


Democratic People*s Republic 


SD 


Sudan 


CG 


Congo 




of Korea 


SE 


Sweden 


CH 


Switzerland 


KR 


Republic of Korea 


SI 


Slovenia 


CI 


CAte d'lvoire 


KZ 


Kazakhstan 


SK 


Slovakia 


CM 


Cameroon 


LI 


Liechtenstein 


SN 


Senegal 


CN 


China 


LK 


Sri Lanka 


TD 


Chad 


CS 


Czechoslovalua 


lAJ 


Luxembourg 


TG 


Togo 


CZ 


Czech Republic 


LV 


Latvia 


TJ 


Tajikistan 


DE 


Germany 


MC 


Monaco 


TT 


Trinidad and Tobago 


DK 


Denmark 


MD 


Republic of Moldova 


UA 


Ukraine 


£S 


Spain 


MG 


Madagascar 


US 


United Stales of America 


H 


Fmland 


ML 


Mali 


uz 


Uzbekistan 


FR 


France 


MN 


Mongolia 


VN 


Viet Nam 


GA 


Gabon 











wo 95/20681 



PCTAJS95/01160 



COMPARATIVE GENE TRANSCRIPT ANALYSIS 
1. FIELD OF INVENTION 

The present invention is in the field of molecular 
biology and computisr science; more particularly, the 
5 present invention describes methods of analyzing gene 

transcripts and diagnosing the genetic expression of cells 
and tissue. 

2. BACKGROUND OF THE INVENTION 

Until very recently, the history of molecular biology 

10 has been written one gene at a time. Scientists have 
observed the cell's physical changes, isolated mixtures 
from the cell or its milieu, purified proteins, sequenced 
proteins and therefrom constructed probes to look for the 
. corresponding gene. 

15 Recently, different nations have set up massive 

projects to sequence the billions of bases in the hviman 
genome. These projects typically begin with dividing the 
genome into large portions of chromosomes and then 
determining the sequences of these pieces, which are then 

20 analyzed for identity with known proteins or portions 

thereof, known as motifs. Unfortunately, the majority of 
genomic DNA does not encode proteins and though it is 
postulated to have some effect on the cell's ability to 
make protein, its relevance to medical applications is not 

25 understood at this time. 

A third methodology involves sequencing only the 
transcripts encoding the cellular machinery actively 
involved in making protein, namely the mRNA. The advantage 
is that the cell has already edited out all the non-coding 

30 DNA, and it is relatively easy to identify the protein- 
coding portion of the RNA. The utility of this approach 
was not immediately obvious to genomic researchers. In 
fact, when cDNA sequencing was initially proposed, the 
method was roundly denounced by those committed to genomic 

35 sequencing. For example, the head of the U.S. Human Genome 
project discounted CDNA sequencing as not valuable and 
refused to approve funding of projects. 

In this disclosure, we teach methods for analyzing 
DNA, including cDNA libraries. Based on our analyses and 



wo 95/20681 



PCT/US95/01160 



research, we see each individual gene product as a "pixel" 
of information, which relates to the expression of that, 
and only that, gene. We teach herein, methods whereby the 
individual "pixels" of gene expression information can be 
5 combined into a single gene transcript "image," in which 
each of the individual genes can be visualized 
simultaneously and allowing relationships between the gene 
pixels to be easily visualized and understood. 

We further teach a new method which we call electronic 
10 subtraction. Electronic subtraction will enable the gene 
researcher to turn a single image into a moving picture, 
one which describes the temporality or dynamics of gene 
expression, at the level of a cell or a whole tissue. It 
is that sense of "motion" of cellular machinery on the 
15 scale of a cell or organ which constitutes the new 

invention herein. This constitutes a new view into the 
process of living cell physiology and one which holds great 
promise to unveil and discover new therapeutic and 
diagnostic approaches in medicine. 
20 We teach another method which we call "electronic 

northern," which tracks the expression of a single gene 
across many types of cells and tissues. 

Nucleic acids (DNA and RNA) carry within their 
sequence the hereditary information and are therefore the 
25 prime molecules of life. Nucleic acids are found in all 

living organisms including bacteria, fungi, viruses, plants 
and animals. It is of interest to determine the relative 
abundance of different discrete nucleic acids in different 
cells, tissues and organisms over time under various 
30 conditions, treatments and regimes. 

All dividing cells in the human body contain the same 
set of 23 pairs of chromosomes. It is estimated that these 
autosomal and sex chromosomes encode approximately 100,000 
genes. The differences among different types of cells are 
35 believed to reflect the differential expression of the 
100,000 or so genes. Fundamental questions of biology 
could be answered by understanding which genes are 
transcribed and knowing the relative abundance of 
transcripts in different cells. 



wo 95/20681 PCT/US95/01160 

Previously, the art has only provided for the analysis 
of a few known genes at a time by standard molecular 
biology techniques such as PGR, northern blot analysis, or 
other types of DNA probe analysis such as in situ 
5 hybridization. Each of these methods allows one to analyze 
the transcription of only known genes and/ or small numbers 
of genes at a time. Nucl. Acids Res. 19, 7097-7104 (1991); 
Nucl. Acids Res. 18, 4833-42 (1990); Nucl. Acids Res. 18, 
2789-92 (1989); European J. Neuroscience 2, 1063-1073 

10 (1990); Analytical Biochem. 187 . 364-73 (1990); Genet. 
Annals Techn. Appl. 7, 64-70 (1990); GATA 8(4), 129-33 
(1991); Proc. Natl. Acad. Sci. USA 85/ 1696-1700 (1988); 
Nucl. Acids Res. 19, 1954 (1991); Proc. Natl. Acad. Sci. 
USA Mr 1943-47 (1991); Nucl. Acids Res. 19, 6123-27 

15 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-42 (1988); 
Nucl. Acids Res. 16, 10937 (1988). 

Studies of the number and types of genes whose 
transcription is induced or otherwise regulated during cell 
processes such as activation, differentiation, aging, viral 

20 transformation, morphogenesis, and mitosis have been 

pursued for many years, using a variety of methodologies. 
One of the earliest methods was to isolate and analyze 
levels of the proteins in a cell, tissue, organ system, or 
even organisms both before and after the process of 

25 interest. One method of analyzing multiple proteins in a 
sample is using 2-dimensional gel electrophoresis, wherein 
proteins can be, in principle, identified and quantified as 
individual bands, and ultimately reduced to a discrete 
signal. At present, 2-dimensional analysis only resolves 

30 approximately 15% of the proteins. In order to positively 
analyze those bands which are resolved, each band must be 
excised from the membrane and subjected to protein sequence 
analysis using Edman degradation. Unfortunately, most of 
the bands were present in quantities too small to obtain a 

35 reliable sequence, and many of those bands contained more 
than one discrete protein. An additional difficulty is 
that many of the proteins were blocked at the 
amino-terminus, further complicating the sequencing 
process. 



3 



wo 95/20681 



PCT/US95/01160 



Analyzing differentiation at the gene transcription 
level has overcome many of these disadvantages and 
drawbacks, since the power of recombinant DNA technology 
allows amplification of signals containing very small 
5 amounts of material. The most common method, called 
"hybridization subtraction," involves isolation of mRNA 
from the biological specimen before (B) and after (A) the 
developmental process of interest, transcribing one set of 
mRNA into cDNA, subtracting specimen B from specimen A 

10 (mRNA from cDNA) by hybridization, and constructing a cDNA 
library from the non-hybridizing mRNA fraction. Many 
different groups have used this strategy successfully, and 
a variety of procedures have been published and improved 
upon using this same basic scheme. Nucl. Acids Res. 19, 

15 7097-7104 (1991); Nucl. Acids Res. 18, 4833-42 (1990); 
■ Nucl. Acids Res. 18, 2789-92 (1989); European J. 
Neuroscience 2i 1063-1073 (1990); Analytical Biochem. 187 , 
364-73 (1990); Genet. Annals Techn. Appl. 2; 64-70 (1990); 
GATA 8(4), 129-33 (1991); Proc. Natl. Acad. Sci. USA 85, 

20 1696-1700 (1988); Nucl. Acids Res. 19, 1954 (1991); Proc. 
Natl. Acad. Sci. USA 88/ 1943-47 (1991); Nucl. Acids Res. 
19, 6123-27 (1991); Proc. Natl. Acad. Sci. USA £5/ 5738-42 
(1988); Nucl. Acids Res. 16, 10937 (1988). 

Although each of these techniques have particular 

25 strengths and weaknesses, there are still some limitations 
and undesirable aspects of these methods: First, the time 
and effort required to construct such libraries is quite 
large. Typically, a trained molecular biologist might 
expect construction and characterization of such a library 

30 to require 3 to 6 months, depending on the level of skill, 
experience, and luck. Second, the resulting subtraction 
libraries are typically inferior to the libraries 
constructed by standard methodology. A typical 
conventional cDNA library should have a clone complexity of 

35 at least 10* clones, and an average insert size of 1-3 kB. 
In contrast, subtracted libraries can have complexities of 
10^ or 10^ and average insert sizes of 0.2 kB. Therefore, 
there can be a significant loss of clone and sequence 
information associated with such libraries. Third, this 



wo 95/20681 



PCT/DS95/01160 



approach allows the researcher to capture only the genes 
induced in specimen A relative to specimen not 
vice-versa, nor does it easily allow comparison to a third 
specimen of interest (C) . Fourth, this approach requires 
5 very large amounts (hundreds of micrograms) of "driver" 
mRNA (specimen B) , which significantly limits the number 
and type of subtractions that are possible since many 
tissues and cells are very difficult to obtain in large 
quantities. 

10 Fifth, the resolution of the subtraction is dependent 

upon the physical properties of DNA:DNA or RNA:DNA 
hybridization. The ability of a given sequence to find a 
hybridization match is dependent on its unique CoT value. 
The CoT value is a function of the number of copies 

15 (concentration) of the particular sequence, multiplied by 
the time of hybridization. It follows that for sequences 
which are abundant, hybridization events will occur very 
rapidly (low CoT value) , while rare sequences will form 
duplexes at very high CoT values. CoT values which allow 

20 such rare sequences to form duplexes and therefore be 
effectively selected are difficult to achieve in a 
convenient time frame. Therefore, hybridization 
subtraction is simply not a useful technique with which to 
study relative levels of rare mRNA species. Sixth, this 

25 problem is further complicated by the fact that duplex 
formation is also dependent on the nucleotide base 
composition for a given sequence. Those sequences rich in 
G + C form stronger duplexes than those with high contents 
of A + T. Therefore, the former sequences will tend to be 

30 removed selectively by hybridization subtraction. Seventh, 
it is possible that hybridization between nonexact matches 
can occur. When this happens, the expression of a 
homologous gene may "mask" expression of a gene of 
interest, artificially skewing the results for that 

35 particular gene. 

Matsubara and Okubo proposed using partial cDNA 
sequences to establish expression profiles of genes which 
could be used in functional analyses of the human genome. 
Matsubara and Okubo warned against using random priming, as 



10 



wo 95/20681 PCTAJS95/01160 

it creates multiple unique DNA fragments from individual 
mRNAs and may thus skew the analysis of the number of 
particular mRNAs per library. They sequenced randomly 
selected members from a 3 '-directed cDNA library and 
5 established the frequency of appearance of the various 
ESTs. They proposed comparing lists of ESTs from various 
cell types to classify genes. Genes expressed in many 
different cell types were labeled housekeepers and those 
selectively expressed in certain cells were labeled cell- 
specific genes, even in the absence of the full sequence of 
the gene or the biological activity of the gene product. 

The present invention avoids the drawbacks of the 
prior art by providing a method to quantify the relative 
abundance of multiple gene transcripts in a given 
15 biological specimen by the use of high-throughput 

sequence-specific analysis of individual RNAs and/or their 
corresponding cDNAs. 

The present invention offers several advantages over 
current protein discovery methods which attempt to isolate 

20 individual proteins based upon biological effects. The 
method of the instant invention provides for detailed 
diagnostic comparisons of cell profiles revealing numerous 
changes in the expression of individual transcripts. 

The instant invention provides several advantages over 

25 current subtraction methods including a more complex 
library analysis (lo** to lo^ clones as compared to 10^ 
clones) which allows identification of low abundance 
messages as well as enabling the identification of messages 
which either increase or decrease in abundance. These 

30 large libraries are very routine to make in contrast to the 
libraries of previous methods. in addition, homologues can 
easily be distinguished with the method of the instant 
invention. 

This method is very convenient because it organizes a 
35 large quantity of data into a comprehensible, digestible 
format. The most significant differences are highlighted 
by electronic subtraction. In depth analyses are made more 
convenient. 



6 



wo 95/20681 



PCT/US9S/01160 



The present invention provides several advantages over 
previous methods of electronic analysis of cDNA. The 
method is particularly powerful when more than 100 and 
preferably more than 1,000 gene transcripts are analyzed. 
5 In such a case, new low-frequency transcripts are 
discovered and tissue typed. 

High resolution analysis of gene expression can be 
used directly as a diagnostic profile or to identify 
disease-specific genes for the development of more classic 
10 diagnostic approaches. 

This process is defined as gene transcript frequency 
analysis. The resulting quantitative analysis of the gene 
transcripts is defined as comparative gene transcript 
analysis. 

15 3. SUMMARY OF THE INVENTION 

The invention is a method of analyzing a specimen 
containing gene transcripts comprising the steps of (a) 
producing a library of biological sequences; (b) generating 
a set of transcript sequences, where each of the transcript 

20 sequences in said set is indicative of a different one of 
the biological sequences of the library; (c) processing the 
transcript sequences in a programmed computer (in which a 
database of reference transcript sequences indicative of 
reference sequences is stored), to generate an identified 

25 sequence value for each of the transcript sequences, where 
each said identified sequence value is indicative of 
sequence annotation and a degree of match between one of 
the biological sequences of the library and at least one of 
the reference sequences; and (d) processing each said 

30 identified sequence value to generate final data values. 

indicative of the number of times each identified sequence 
value is present in the library. 

The invention also includes a method of comparing two 
specimens containing gene transcripts. The first specimen 

35 is processed as described above. The second specimen is 
used to produce a second library of biological sequences, 
which is used to generate a second set of transcript 
sequences, where each of the transcript sequences in the 



7 



wo 95/20681 PCr/DS95/01160 

In a further embodiment, the relative abundance of the 
gene transcripts in one cell type or tissue is compared 
with the relative abundance of gene transcript numbers in a 
second cell type or tissue in order to identify the 
5 differences and similarities. 

In a further embodiment, the method includes a system 
for analyzing a library of biological sequences including a 
means for receiving a set of transcript sequences, where 
each of the transcript sequences is indicative of a 

10 different one of the biological sequences of the library; 
and a means for processing the transcript sequences in a 
computer system in which a database of reference transcript 
sequences indicative of reference sequences is stored, 
wherein the computer is programmed with software for 

15 generating an identified sequence value for each of the 
transcript sequences, where each said identified sequence 
value is indicative of a sequence annotation and the degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 

20 sequences, and for processing each said identified sequence 
value to generate final data values indicative of the 
number of times each identified sequence value is present 
in the library. 

In essence, the invention is a method and system for 

25 quantifying the relative abundance of gene transcripts in a 
biological specimen. The invention provides a method for 
comparing the gene transcript image from two or more 
different biological specimens in order to distinguish 
between the two specimens and identify one or more genes 

30 which are differentially expressed between the two 
specimens. Thus, this gene transcript image and its 
comparison can be used as a diagnostic. One embodiment of 
the method generates high-throughput sequence-specific 
analysis of multiple RNAs or their corresponding cDNAs: a 

35 gene transcript image. Another embodiment of the method 

produces the gene transcript imaging analysis by the use of 
high-throughput cDNA sequence analysis. In addition, two 
or more gene transcript images can be compared and used to 
detect or diagnose a particular biological state, disease, 

9 



wo 95/20681 



PCT/US95/01160 



or condition which is correlated to the relative abundance 
of gene transcripts in a given cell or population of cells. 

4 . DESCRIPTION OF THE TABLES AND DRAWINGS 

4.1. TABLES 

5 Table 1 presents a detailed explanation of the letter 

codes utilized in Tables 2-5. 

• ■ ' ■■ ■ . . -rt- ■ ' 

Table 2 lists the one hundred most common gene 
transcripts. It is a partial list of isolates from the 
HUVEC cDNA library prepared and sequencetj as described 

10 below. The left-hand column refers to the sequence's order 
of abundance in this table. The next column labeled 
"number" is the clone number of the first HUVEC sequence 
identification reference matching the sequence in the 
"entry" column number. Isolates that have not been 

15 sequenced are not present in Table 2. The next column, 

labeled "N" , indicates the total number of cDNAs which have 
the same degree of match with the sequence of the reference 
transcript in the "entry" column. 

The column labeled "entry" gives the NIH GENBANK locus 

20 name, which corresponds to the library sequence numbers. 
The "s" column indicates in a few cases the species of the 
reference sequence. The code for column "s" is given in 
Table 1. The column labeled "descriptor" provides a plain 
English explanation of the identity of the sequence 

25 corresponding to the NIH GENBANK locus name in the "entry" 
column. 

Table 3 is a comparison of the top fifteen most 
abundant gene transcripts in normal monocytes and activated 
macrophage cells. 

30 Table 4 is a detailed summary of library subtraction 

analysis summary comparing the THP-1 and human macrophage 
cDNA sequences. In Table 4, the same code as in Table 2 is 
used. Additional columns are for "bgfreq" (abundance 
number in the subtractant library) , "rfend" (abundance 

35 number in the target library) and "ratio" (the target 
abundance number divided by the subtractant abundance 
number) . As is clear from perusal of the table, when the 
abundance number in the subtractant library is "0", the 

10 



wo 95/2068 1 PCT/US95/01 1 60 

target abundance number is divided by 0.05. This is a way 
of obtaining a result (not possible dividing by 0) and 
distinguishing the result from ratios of subtractant 
numbers of !• 

5 Table 5 is the computer program, written in source 

code, for generating gene transcript subtraction profiles. 
Table 6 is a partial listing of database entries used 
in the electronic northern blot analysis as provided by the 
present invention. 

4.2. BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a chart summarizing data collected and 
stored regarding the library construction portion of 
sequence preparation and analysis. 

15 Figure 2 is a diagram representing the sequence of 

operations performed by "abundance sort" software in a 
class of preferred embodiments of the inventive method. 

Figure 3 is a block diagram of a preferred embodiment 
of the system of the invention. 

20 Figure 4 is a more detailed block diagram of the 

bioinf ormatics process from new sequence (that has already 
been sequenced but not identified) to printout of the 
transcript imaging analysis and the provision of database 
subscriptions. 

25 5. DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method to compare the 
relative abundance of gene transcripts in different 
biological specimens by the use of high-throughput 
sequence-specific analysis of individual RNAs or their 

30 corresponding cDNAs (or alternatively, of data representing 
other biological sequences) • This process is denoted 
herein as gene transcript imaging. The quantitative 
analysis of the relative abundance for a set of gene 
transcripts is denoted herein as "gene transcript image 

35 analysis" or "gene transcript frequency analysis". The 
present invention allows one to obtain a profile for gene 
transcription in any given population of cells or tissue 
from any type of organism. The invention can be applied to 

11 



wo 95/20681 



PCTAJS95/01160 



obtain a profile of a specimen consisting of a single cell 
(or clones of a single cell) , or of many cells, or of 
tissue more complex than a single cell and containing 
multiple cell types, such as liver. 
5 The invention has significant advantages in the fields 

of diagnostics, toxicology and pharmacology, to name a few. 
A highly sophisticated diagnostic test can be performed on 
the ill patient in whom a diagnosis has not been made. A 
biological specimen consisting of the patient's fluids or 

10 .tissue gen^^, tr^njscripts are isolated 

and expanded to the extent necessary to determine their 
identity. Optionally, the gene transcripts can be 
converted to cDNA. A sampling of the gene transcripts are 
subjected to sequence-specific analysis and quantified. 

15 These gene transcript sequence abundances are compared 
against reference database sequence abundances including 
normal data sets for diseased and healthy patients. The 
patient has the disease (s) with which the patient's data 
set most closely correlates. 

20 For example, gene transcript frequency analysis can be 

used to differentiate normal cells or tissues from diseased 
cells or tissues, just as it highlights differences between 
normal monocytes and activated macrophages in Table 3. 

In toxicology, a fundamental question is which tests 

25 are most effective in predicting or detecting a toxic 

effect. Gene transcript imaging provides highly detailed 
information on the cell and tissue environment, some of 
which would not be obvious in conventional, less detailed 
screening methods. The gene transcript image is a more 

30 powerful method to predict drug toxicity and efficacy. 
Similar benefits accrue in the use of this tool in 
pharmacology. The gene transcript image can be used 
selectively to look at protein categories which are 
expected to be affected, for example, enzymes which 

35 detoxify toxins. 

In an alternative embodiment, comparative gene 
transcript frequency analysis is used to differentiate 
between cancer cells which respond to anti-cancer agents 
and those which do not respond. Examples of anti-cancer 

12 



wo 95/20681 



PCT/US95/01160 



agents are tamoxifen, vincristine, vinblastine, 
podophyllotoxins , etoposide , tenisposide , cisplatin , 
biologic response modifiers such as interferon, 11-2, GM- 
CSF, enzymes, hormones and the like. This method also 
5 provides a means for sorting the gene transcripts by 
functional category. In the case of cancer cells, 
transcription factors or other essential regulatory 
molecules are very important categories to analyze across 
different libraries. 

,10 In. yet .anQthe,r embodimen^^ gene transcript 

frequency analysis is used to differentiate between control 
liver cells and liver cells isolated from patients treated 
with experimental drugs like FIAU to distinguish between 
pathology caused by the underlying disease and that caused 

15 by the drug. 

In yet another embodiment, comparative gene transcript 
frequency analysis is used to differentiate between brain 
tissue from patients treated and untreated with lithium. 

In a further embodiment, comparative gene transcript 

20 frequency analysis is used to differentiate between 
cyclosporin and FK506*treated cells and normal cells. 

In a further embodiment, comparative gene transcript 
frequency analysis is used to differentiate between virally 
infected (including HIV-infected) human cells and 

25 uninfected human cells. Gene transcript frequency analysis 
is also used to rapidly survey gene transcripts in HIV- 
resistant, HIV-infected, and HIV-sensitive cells. 
Comparison of gene transcript abundance will indicate the 
success of treatment and/ or new avenues to study. 

30 In a further embodiment, comparative gene transcript 

frequency analysis is used to differentiate between 
bronchial lavage fluids from healthy and unhealthy patients 
with a variety of ailments. 

In a further embodiment, comparative gene transcript 

35 frequency analysis is used to differentiate between cell, 
plant, microbial and animal mutants and wild-type species. 
In addition, the transcript abundance program is adapted to 
permit the scientist to evaluate the transcription of one 
gene in many different tissues. Such comparisons could 

13 



wo 95/20681 



PCT/US95/01160 



identify deletion mutants which do not produce a gene 
product and point mutants which produce a less abundant or 
otherwise different message. Such mutations can affect 
basic biochemical and pharmacological processes, such as 
5 mineral nutrition and metabolism, and can be isolated by 
means known to those skilled in the art. Thus, crops with 
improved yields, pest resistance and other factors can be 
developed. 

In a further embodiment, comparative gene transcript 

10 frequency analysis is used for an interspecies comparative 
analysis which would allow for the selection of better 
pharmacologic animal models. In this embodiment, humans 
and other animals (such as a mouse) , or their cultured 
cells are treated with a specific test agent. The relative 

15 sequence abundance of each cDNA population is determined. 
• If the animal test system is a good model, homologous genes 
in the animal cDNA population should change expression 
similarly to those in human cells. If side effects are 
detected with the drug, a detailed transcript abundance 

20 analysis will be performed to survey gene transcript 

changes. Models will then be evaluated by comparing basic 
physiological changes. 

In a further embodiment, comparative gene transcript 
frequency analysis is used in a clinical setting to give a 

25 highly detailed gene transcript profile of a patient's 
cells or tissue (for example, a blood sample) . In 
particular, gene transcript frequency analysis is used to 
give a high resolution gene expression profile of a 
diseased state or condition. 

30 In the preferred embodiment, the method utilizes 

high-throughput cDNA sequencing to identify specific 
transcripts of interest. The generated cDNA and deduced 
amino acid sequences are then extensively compared with 
GENBANK and other sequence data banks as described below. 

35 The method offers several advantages over current protein 
discovery by two-dimensional gel methods which try to 
identify individual proteins involved in a particular 
biological effect. Here, detailed comparisons of profiles 
of activated and inactive cells reveal numerous changes in 

14 



wo 95/20681 



PCTAJS95/01160 



the expression of individual transcripts. After it is 
determined if the sequence is an "exact" match, similar or 
a non-match, the sequence is entered into a database. 
Next, the numbers of copies of cDNA corresponding to each 
5 gene are tabulated. Although this can be done slowly and 
arduously, if at all, by human hand from a printout of all 
entries, a computer program is a useful and rapid way to 
tabulate this information. The numbers of cDNA copies 
(optionally divided by the total number of sequences in the 

10 data set) provides a .picture of the. relative abundance of 
transcripts for each corresponding gene. The list of 
represented genes can then be sorted by abundance in the 
cDNA population. A multitude of additional types of 
comparisons or dimensions are possible and are exemplified 

15 below. 

An alternate method of producing a gene transcript 
image includes the steps of obtaining a mixture of test 
mRNA and providing a representative array of unique probes 
whose sequences are complementary to at least some of the 

20 test mRNAs. Next, a fixed amount of the test mRNA is added 
to the arrayed probes. The test mRNA is incubated with the 
probes for a sufficient time to allow hybrids of the test 
mRNA and probes to form. The mRNA-probe hybrids are 
detected and the quantity determined. The hybrids are 

25 identified by their location in the probe array. The 
quantity of each hybrid is summed to give a population 
number. Each hybrid quantity is divided by the population 
number to provide a set of relative abundance data termed a 
gene transcript image analysis. 

30 6. EXAMPLES 

The examples below are provided to illustrate the 
subject invention. These examples are provided by way of 
illustration and are not included for the purpose of 
limiting the invention. 

35 6.1. TISSUE SOURCES AND CELL LINES 

For analysis with the computer program claimed herein, 
biological sequences can be obtained from virtually any 

15 



wo 95/20681 PCT/US95/01160 

source* Most popular are tissues obtained from the human 
body. Tissues can be obtained from any organ of the body, 
any age donor, any abnormality or any immortalized cell 
line. Immortal cell lines may be preferred in some 
5 instances because of their purity of cell type; other 
tissue samples invariably include mixed cell types. A 
special technique is available to take a single cell (for 
example, a brain cell) and harness the cellular machinery 
to grow up sufficient cDNA for sequencing by the techniques 

10 and analysis described herein (cf. U.S. Patent Nos. 
5,021,33 5 and 5,168,038, which are incorporated by 
reference) . The examples given herein utilized the 
following immortalized cell lines: monocyte-like U-937 
cells, activated macrophage-like THP-1 cells, induced 

15 vascular endothelial cells (HUVEC cells) and mast cell-like 
HMC-1 cells. 

The U-937 cell line is a human histiocytic lymphoma 
cell line with monocyte characteristics, established from 
malignant cells obtained from the pleural effusion of a 

20 patient with diffuse histiocytic lymphoma (Sundstrom, C. 
and Nilsson, K. (1976) Int. J. Cancer 17:565). U-937 is 
one of only a few human cell lines with the morphology, 
cytochemistry, surface receptors and monocyte-like 
characteristics of histiocytic cells. These cells can be 

25 induced to terminal monocytic differentiation and will 
express new cell surface molecules when activated with 
supernatants from human mixed lymphocyte cultures. Upon 
this type of in vitro activation, the cells undergo 
morphological and functional changes, including 

30 augmentation of antibody-dependent cellular cytotoxicity 

(ADCC) against erythroid and tumor target cells (one of the 
principal functions of macrophages) . Activation of U-937 
cells with phorbol 12-myristate 13-acetate (PMA) in vitro 
stimulates the production of several compounds, including 

35 prostaglandins, leukotrienes and platelet-activating factor 
(PAF) , which are potent inflammatory mediators. Thus, U- 
937 is a cell line that is well suited for the 
identification and isolation of gene transcripts associated 
with normal monocytes. 

16 



wo 95/2068 1 PCTAJS95/01 160 

The HUVEC cell line is a normal, homogeneous, well 
characterized, early passage endothelial cell culture from 
human umbilical vein (Cell Systems Corp., 12815 NE 124th 
Street, Kirkland, WA 98034) . Only gene transcripts from 
5 induced, or treated, HUVEC cells were sequenced. One batch 
of 1 X 10^ cells was treated for 5 hours with 1 U/ml rIL-lb 
and 100 ng/ml E>coli lipopolysaccharide (LPS) endotoxin 
prior to harvesting. A separate batch of 2 X 10^ cells was 
treated at confluence with 4 U/ml TNF and 2 U/ml 

10 interferon-gamma (IFN-gamma) prior to harvesting. 

THP-1 is a human leukemic cell line with distinct 
monocytic characteristics. This cell line was derived from 
the blood of a 1-year-old boy with acute monocytic leukemia 
(Tsuchiya, S. et al. (1980) Int. J. Cancer: 171-76) . The 

15 following cytological and cytochemical criteria were used 
to determine the monocytic nature of the cell line: 1) the 
presence of alpha-naphthyl butyrate esterase activity which 
could be inhibited by sodium fluoride; 2) the production of 
lysozyme; 3) the phagocytosis of latex particles and 

20 sensitized SRBC (sheep red blood cells) ; and 4) the ability 
of mitomycin C-treated THP-1 cells to activate T- 
lymphocytes following ConA (concanavalin A) treatment. 
Morphologically, the cytoplasm contained small azurophilic 
granules and the nucleus was indented and irregularly 

25 shaped with deep folds. The cell line had Fc and C3b 
receptors, probably functioning in phagocytosis. THP-1 
cells treated with the tumor promoter 12-o-tetradecanoyl- 
phorbol-13 acetate (TPA) stop proliferating and 
differentiate into macrophage-like cells which mimic native 

30 monocyte-derived macrophages in several respects. 

Morphologically, as the cells change shape, the nucleus 
becomes more irregular and additional phagocytic vacuoles 
appear in the cytoplasm. The differentiated THP-1 cells 
also exhibit an increased adherence to tissue culture 

35 plastic. 

HMC-1 cells (a human mast cell line) were established 
from the peripheral blood of a Mayo Clinic patient with 
mast cell leukemia (Leukemia Res. (1988) 12:345-55). The 
cultured cells looked similar to immature cloned murine 

17 



wo 95/20681 



PCTAJS95/0n60 



mast cells, contained histamine, and stained positively for 
chloroacetate esterase, amino caproate esterase, eosinophil 
major basic protein (MBP) and tryptase. The HMC-1 cells 
have, however, lost the ability to synthesize normal IgE 
5 receptors. HMC-1 cells also possess a 10;16 translocation, 
present in cells initially collected by leukophoresis from 
the patient and not an artifact of culturing. Thus, HMC-1 
cells are a good model for mast cells. 

6.2. CONSTRUCTION OF CDNA LIBRARIES 

ib For inter-library cbmpar must be 

prepared in similar manners. Certain parameters appear to 
be particularly important to control. One such parameter 
is the method of isolating mRNA. It is important to use 
the same conditions to remove DNA and heterogeneous nuclear 

15 RNA from comparison libraries. Size fractionation of cDNA 
must be carefully controlled. The same vector preferably 
should be used for preparing libraries to be compared. At 
the very least, the same type of vector (e.g., 
unidirectional vector) should be used to assure a valid 

20 comparison. A unidirectional vector may be preferred in 
order to more easily analyze the output. 

It is preferred to prime only with oligo dT 
unidirectional primer in order to obtain one only clone per 
mRNA transcript when obtaining cDNAs. However, it is 

25 recognized that employing a mixture of oligo dT and random 
primers can also be advantageous because such a mixture 
results in more sequence diversity when gene discovery also 
is a goal. Similar effects can be obtained with DR2 
(Clontech) and HXLOX (US Biochemical) and also vectors from 

30 Invitrogen and Novagen. These vectors have two 

requirements. First, there must be primer sites for 
commercially available primers such as T3 or M13 reverse 
primers. Second, the vector must accept inserts up to 10 
kB. 

35 It also is important that the clones be randomly 

sampled, and that a significant population of clones is 
used. Data have been generated with 5,000 clones; however, 
if very rare genes are to be obtained and/ or their relative 

18 



wo 95/20681 



PCT/DS95/01160 



abundance determined, as many as 100,000 clones from a 
single library may need to be sampled. Size fractionation 
of cDNA also must be carefully controlled. Alternately, 
plaques can be selected, rather than clones. 
5 Besides the Uni-ZAP™ vector system by Stratagene 

disclosed below, it is now believed that other similarly 
unidirectional vectors also can be used. For example, it 
is believed that such vectors include but are not limited 
to DR2 (Clontech) , and HXLOX (U.S. Biochemical). 

10 Preferably, the details of library construction (as 

"shown in Figure i) a're coll ected and st^^ in a database 
for later retrieval relative to the sequences being 
compared. Fig. 1 shows important information regarding the 
library collaborator or cell or cDNA supplier, 

15 pretreatment, biological source, culture, mRNA preparation 
• and CDNA construction. Similarly detailed information 
about the other steps is beneficial in analyzing sequences 
and libraries in depth. 

RNA must be harvested from cells and tissue samples 

20 and cDNA libraries are subsequently constructed. cDNA 

libraries can be constructed according to techniques known 
in the art. (See, for example, Maniatis, T. et al. (1982) 
Molecular Cloning, Cold Spring Harbor Laboratory, New 
York) . cDNA libraries may also be purchased. The U-937 

25 CDNA library (catalog No. 937207) was obtained from 

Stratagene, Inc., 11099 M. Torrey Pines Rd., La Jolla, CA 
92037. 

The THP-1 cDNA library was custom constructed by 
Stratagene from THP-1 cells cultured 48 hours with 100 nm 

30 TPA and 4 hpurs with 1 /zg/ml LPS. The human mast cell HMC- 
1 cDNA library was also custom constructed by Stratagene 
from cultured HMC-l cells. The HUVEC cDNA library was 
custom constructed by Stratagene from two batches of 
induced HUVEC cells which were separately processed. 

35 Essentially, all the libraries were prepared in the 

same manner. First, poly (A+) RNA (mRNA) was purified. For 
the U-937 and HMC-1 RNA, cDNA synthesis was only primed 
with oligo dT. For the THP-1 and HUVEC RNA, cDNA synthesis 
was primed separately with both oligo dT and random 



wo 95/20681 



PCTAJS95/01160 



hexamers, and the two cDNA libraries were treated 
separately* Synthetic adaptor oligonucleotides were 
ligated onto cDNA ends enabling its insertion into the Uni- 
Zap™ vector system (Stratagene) , allowing high efficiency 
5 unidirectional (sense orientation) lambda library 

construction and the convenience of a plasmid system with 
blue-white color selection to detect clones with cDNA 
insertions. Finally, the two libraries were combined into 
a single library by mixing equal numbers of bacteriophage. 
10 The libraries can be screened with either DNA probes 

or antibody probes and the pBluescript 

(Stratagene) can be rapidly excised in vivo . The phagemid 
allows the use of a plasmid system for easy insert 
characterization, sequencing, site-directed mutagenesis, 

15 the creation of unidirectional deletions and expression of 
fusion proteins. The custom-constructed library phage 
particles were infected into E. coli host strain XLl-Blue(S) 
(Stratagene) , which has a high transformation efficiency, 
increasing the probability of obtaining rare, under- 

20 represented clones in the cDNA library. 

6.3- ISOLATION OF cDNA CLONES 

The phagemid forms of individual cDNA clones were 
obtained by the in vivo excision process, in which the host 
bacterial strain was coinfected with both the lambda 

25 library phage and an fl helper phage. Proteins derived 

from both the library-containing phage and the helper phage 
nicked the lambda DNA, initiated new DNA synthesis from 
defined sequences on the lambda target DNA and created a 
smaller, single stranded circular phagemid DNA molecule 

30 that included all DNA sequences of the pBluescript® plasmid 
and the cDNA insert. The phagemid DNA was secreted from 
the cells and purified, then used to re-infect fresh host 
cells, where the double stranded phagemid DNA was produced. 
Because the phagemid carries the gene for beta-lactamase, 

35 the newly-transformed bacteria are selected on medium 
containing ampicillin. 

Phagemid DNA was purified using the Magic Minipreps™ 
DNA Purification System (Promega catalogue #A7100. Promega 



wo 95/20681 PCT/DS95/01160 

Corp., 2800 Woods Hollow Rd., Madison, WI 53711). This 
small-scale process provides a simple and reliable method 
for lysing the bacterial cells and rapidly isolating 
purified phagemid DNA using a proprietary DNA-binding 
5 resin. The DNA was eluted from the purification resin 
already prepared for DNA sequencing and other analytical 
manipulations. 

Phagemid DNA was also purified using the QIAwell-8 
Plasmid Purification System from QIAGEN® DNA Purification 

10 System (QIAGEN Inc., 9259 Eton Ave., Chattsworth, CA 

91311) . This product line provides a convenient, rapid and 
reliable high-throughput method for lysing the bacterial 
cells and isolating highly purified phagemid DNA using 
QIAGEN anion-exchange resin particles with EMP0RE7" membrane 

15 technology from 3M in a multiwell format. The DNA was 

eluted from the purification resin already prepared for DNA 
sequencing and other analytical manipulations. 

An alternate method of purifying phagemid has recently 
become available. It utilizes the Min'iprep Kit (Catalog 

20 No. 77468, available from Advanced Genetic Technologies 
Corp., 19212 Orbit Drive, Gaithersburg, Maryland). This 
kit is in the 96-well format and provides enough reagents 
for 960 purifications. Each kit is provided with a 
recommended protocol, which has been employed except for 

25 the following changes. First, the 96 wells are each filled 
with only 1 ml of sterile terrific broth with carbenicillin 
at 25 mg/L and glycerol at 0.4%. After the wells are 
inoculated, the bacteria are cultured for 24 hours and 
lysed with 60 /il of lysis buffer. A centrif ugation step 

30 (2900 rpm for 5 minutes) is performed before the contents 
of the block are added to the primary filter plate. The 
optional step of adding isopropanol to TRIS buffer is not 
routinely performed. After the last step in the protocol, 
samples are transferred to a Beckman 96-well block for 

35 storage. 

Another new DNA purification system is the WIZARD™ 
product line which is available from Promega (catalog No. 
A7071) and may be adaptable to the 96-well format. 



21 



wo 95/20681 



PCT/DS95/01160 



6-4. SEQUENCING OF cDNA CLONES 

The cDNA inserts from random isolates of the U-937 and 
THP-1 libraries were sequenced in part. Methods for DNA 
sequencing are well known in the art. Conventional 
5 enzymatic methods employ DNA polymerase Klenow fragment, 
Sequenase™ or Taq polymerase to extend DNA chains from an 
oligonucleotide primer annealed to the DNA template of 
interest. ' riethods have been developed for the use of both 
single- "arid double-stranded templates. " The chain 

10 termination reaction products are usually electrophoresed 
oh urea-acfylamide gels and are detected' either by 
autoradiography (for radionuclide-labeled precursors) or by 
fluorescence (for fluorescent-labeled precursors) . Recent 
improvements in mechanized reaction preparation, sequencing 

15 and analysis using the fluorescent detection method have 
permitted expansion in the number of sequences that can be 
determined per day (such as the Applied Biosystems 373 and 
377 DNA sequencer, Catalyst 800). Currently with the 
system as described, read lengths range from 250 to 400 

20 bases and are clone dependent. Read length also varies 
with the length of time the gel is run. In general, the 
shorter runs tend to truncate the sequence. A minimum of 
only about 25 to 50 bases is necessary to establish the 
identification and degree of homology of the sequence. 

25 Gene transcript imaging can be used with any sequence- 
specific method, including, but not limited to 
hybridization, mass spectroscopy, capillary electrophoresis 
and 505 gel electrophoresis. 

6.5. HOMOLOGY SEARCHING OP cDNA CLONE AND 
30 DEDUCED PROTEIN (and Subsequent Steps > 

Using the nucleotide sequences derived from the cDNA 

clones as query sequences (sequences of a Sequence 

Listing) , databases containing previously identified 

sequences are searched for areas of homology (similarity) . 

35 Examples of such databases include Genbank and EMBL. We 

next describe examples of two homology search algorithms 

that can be used, and then describe the subsequent 

computer-implemented steps to be performed in accordance 

with preferred embodiments of the invention. 



wo 95/20681 



PCT/US95/01160 



In the following description of the computer- 
implemented steps of the invention, the word "library" 
denotes a set (or population) of biological specimen 
nucleic acid sequences. A "library" can consist of cDNA 
5 sequences, RNA sequences, or the like, which characterize a 
biological specimen. The biological specimen can consist 
of cells of a single human cell type (or can be any of the 
other aBove-mentipned types of specimens) . We contemplate 
that the sequences in a library ha^ determined so as 

10 to accurately represent or characterize a biological 

specimen (for example ,^ they can consist of representative 
cDNA sequences from clones of RNA taken from a single human 
cell)* 

In the following description of the computer- 
15 implemented steps of the invention, the expression 

"database" denotes a set of stored data which represent a 
collection of sequences, which in turn represent a 
collection of biological reference materials. For example, 
a database can consist of data representing many stored 
20 cDNA sequences which are in turn representative of human 
cells infected with various viruses, cells of humans of 
various ages, cells from different mammalian species, and 
so on. 

In preferred embodiments, the invention employs a 
25 computer programmed with software (to be described) for 
performing the following steps: 

(a) processing data indicative of a library of cDNA 
sequences (generated as a result of high-throughput cDNA 
sequencing or other method) to determine whether each 

30 sequence in the library matches a DNA sequence of a 

reference database of DNA sequences (and if so, identifying 
the reference database entry which matches the sequence and 
indicating the degree of match between the reference 
sequence and the library sequence) and assigning an 

35 identified sequence value based on the sequence annotation 
and degree of match to each of the sequences in the 
library; 

(b) for some or all entries of the database, 
tabulating the number of matching identified sequence 

23 



wo 95/20681 



PCT/US95/01160 



values in the library (Although this can be done by human 
hand from a printout of all entries, we prefer to perforin 
this step using computer software to be described below,), 
thereby generating a set of final data values or "abundance 
5 numbers"; and 

(c) if the libraries are different sizes, dividing 
each abundance number by the total number of sequences in 
the library, to obtain a relative iabundance number for each 
identified sequence value (i.e., a relative abundance of 

10 each gene transcript). 

The list of identified sequence values (or genes 
corresponding thereto) can then be sorted by abundance in 
the cDNA population. A multitude of additional types of 
comparisons or dimensions are possible. 

15 For example (to be described below in greater detail) , 

steps (a) and (b) can be repeated for two different 
libraries (sometimes referred to as a "target" library and 
a "subtractant" library). Then, for each identified 
sequence value (or gene transcript) , a "ratio" value is 

20 obtained by dividing the abundance number (for that 

identified sequence value) for the target library, by the 
abundance number (for that identified sequence value) for 
the subtractant library, 

In fact, subtraction may be carried out on multiple 

25 libraries. It is possible to add the transcripts from 

several libraries (for example, three) and then to divide 
them by another set of transcripts from multiple libraries 
(again, for example, three) . Notation for this operation 
may be abbreviated as (A+B+C) / (D+E+F) , where the capital 

30 letters each indicate an entire library. Optionally the 
abundance numbers of transcripts in the summed libraries 
may be divided by the total sample size before subtraction. 

Unlike standard hybridization technology which permits 
a single subtraction of two libraries, once one has 

35 processed a set or library transcript sequences and stored 
them in the computer, any number of subtractions can be 
performed on the library. For example, by this method, 
ratio values can be obtained by dividing relative abundance 



24 



wo 95/20681 



PCT/DS9S/01160 



values in a first library by corresponding values in a 
second library and vice versa. 

In variations on step (a) , the library consists of 
nucleotide sequences derived from cDNA clones. Examples of 
5 databases which can be searched for areas of homology 

(similarity) in step (a) include the commercially available 
databases known as Genbank (NIH) EMBL (European Molecular 
Biology Labs, Germany) , and GENESEQ (Intelligenetics, 
Mountain View, California). 

10 One homology search algorithm which can be used to 

implement step (a) is the algorithm described in the paper 
by D.J. Lipman and W.R. Pearson, entitled "Rapid and 
Sensitive Protein Similarity Searches," Science . 227:1435 
(1985). In this algorithm, the homologous regions are 

15 searched in a two-step manner. In the first step, the 

highest homologous regions are determined by calculating a 
matching score using a homology score table. The parameter 
"Ktup" is used in this step to establish the minimum window 
size to be shifted for comparing two sequences. Ktup also 

20 sets the number of bases that must match to extract the 
highest homologous region among the sequences. In this 
step, no insertions or deletions are applied and the 
homology is displayed as an initial (INIT) value. 

In the second step, the homologous regions are aligned 

25 to obtain the highest matching score by inserting a gap in 
order to add a probable deleted portion. The matching 
score obtained in the first step is recalculated using the 
homology score Table and the insertion score Table to an 
optimized (OPT) value in the final output. 

30 DNA homologies between two sequences can be examined 

graphically using the Harr method of constructing dot 
matrix homology plots (Needleman, S.B. and Wunsch, CO., J. 
Mom. Biol 48:443 (1970)). This method produces a 
two-dimensional plot which can be useful in determining 

35 regions of homology versus regions of repetition. 

However, in a class of preferred embodiments, step (a) 
is implemented by processing the library data in the 
commercially available computer program known as the 
INHERIT 670 Sequence Analysis System, available from 

25 



wo 95/20681 



PCT/US95/01160 



Applied Biosystems Inc. (Foster City, California) , 
including the software known as the Factura software (also 
available frora Applied Biosystems Inc.)* The Factura 
program preprocesses each library sequence to "edit out" 
5 portions thereof which are not likely to be of interest, 
such as the vector used to prepare the library. Additional 
sequences which can be edited out or masked (ignored by the 
search tools) include but are not limited to the polyA tail 
and repetitive gAg and CCC sequences. A low-end search* 

10 program can be written to mask out such "low-information" 
sequences, or programs such as BLAST can ignore the low- 
information sequences. 

In the algorithm implemented by the INHERIT 670 
Sequence Analysis System, the Pattern Specification 

15 Language (developed by TRW Inc.) is used to determine 
regions of homology. "There are three parameters that 
determine how INHERIT analysis runs sequence comparisons: 
window size, window offset and error tolerance. Window 
size specifies the length of the segments into which the 

20 query sequence is subdivided. Window offset specifies 

where to start the next segment [to be compared] , counting 
from the beginning of the previous segment. Error 
tolerance specifies the total number of insertions, 
deletions and/or substitutions that are tolerated over the 

25 specified word length. Error tolerance may be set to any 
integer between 0 and 6. The default settings are window 
tolerance=20, window offset=10 and error tolerance=3 . " 
INHERIT Analysis Users Manual , pp. 2-15. Version 1.0, 
Applied Biosystems, Inc., October 1991. 

30 Using a combination of these three parameters, a 

database (such as a DNA database) can be searched for 
sequences containing regions of homology and the 
appropriate sequences are scored with an initial value. 
Subsequently, these homologous regions are examined using 

35 dot matrix homology plots to determine regions of homology 
versus regions of repetition. Smith-Waterman alignments 
can be used to display the results of the homology search. 
The INHERIT software can be executed by a Sun computer 
system programmed with the UNIX operating system, 

26 



wo 95/20681 



PCTAJS95/01160 



Search alternatives to INHERIT include the BLAST 
program, GCG (available from the Genetics Computer Group, 
WI) and the Dasher program (Temple Smith, Boston 
University, Boston, MA) . Nucleotide sequences can be 
5 searched against Genbank, EMBL or custom databases such as 
GENESEQ (available from Intelligenetics, Mountain View, CA) 
or other databases for genes. In addition, we have 
searched some sequences against our own in-house database. 
In preferred embod^ transcript sequences are 

10 analyzed by the INHERIT software for best conformance with 
a reference gene transcript to assign a sequence identifier 
and assigned the degree of homology, which together are the 
identified sequence value and are input into, and further 
processed by, a Macintosh personal computer (available from 

15 Apple) programmed with an "abundance sort and subtraction 
analysis" computer program (to be described below) • 

Prior to the abundance sort and subtraction analysis 
program (also denoted as the "abundance sort" program) , 
identified sequences from the cDNA clones are assigned 

20 value (according to the parameters given above) by degree 
of match according to the following categories: "exact" 
matches (regions with a high degree of identity) , 
homologous human matches (regions of high similarity, but 
hot "exact" matches) , homologous non-human matches (regions 

25 of high similarity present in species other than human) , or 
non matches (no significant regions of homology to 
previously identified nucleotide sequences stored in the 
form of the database) • Alternately, the degree of match 
can be a numeric value as described below. 

30 With reference again to the step of identifying 

matches between reference sequences and database entries, 
protein and peptide sequences can be deduced from the 
nucleic acid sequences. Using the deduced polypeptide 
sequence, the match identification can be performed in a 

35 manner analogous to that done with cDNA sequences. A 

protein sequence is used as a query sequence and compared 
to the previously identified sequences contained in a 
database such as the Swiss/Prot, PIR and the NBRF Protein 
database to find homologous proteins. These proteins are 



wo 95/20681 



PCT/DS95/01160 



initially scored for homology using a homology score Table 
(Orcutt, B.C. and Dayoff, M.o, Scoring Matrices, PIR 
Report MAT - 0285 (February 1985)) resulting in an INIT 
score. The homologous regions are aligned to obtain the 
5 highest matching scores by inserting a gap which adds a 
probable deleted portion. The matching score is 
recalculated using the homology score Table and the 
insertion score Table resulting in an optimized (OPT) 
score. Even in the absence of knowledge of the proper 

10 reading frame of an isolated sequence, the above-described 
protein homology search may be performed by searching all 3 
reading frames. 

Peptide and protein sequence homologies can also be 
ascertained using the INHERIT 670 Sequence Analysis System 

15 in an analogous way to that used in DNA sequence 

homologies. Pattern Specification Language and parameter 
windows are used to search protein databases for sequences 
containing regions of homology which are scored with an 
initial value. Subsequent display in a dot-matrix homology 

20 plot shows regions of homology versus regions of 

repetition. Additional search tools that are available to 
use on pattern search databases include PLsearch Blocks 
(available from Henikoff & Henikoff , University of 
Washington, Seattle) , Dasher; and GCG. Pattern search 

25 databases include, but are not limited to, Protein Blocks 
(available from Henikoff & Henikoff, University of 
Washington, Seattle), Brookhaven Protein (available from 
the Brookhaven National Laboratory, Brookhaven, MA) , 
PROSITE (available from Amos Bairoch, University of Geneva, 

30 Switzerland), ProDom (available from Temple Smith, Boston 
University) , and PROTEIN MOTIF FINGERPRINT (available from 
University of Leeds, United Kingdom) . 

The ABI Assembler application software, part of the 
INHERIT DNA analysis system (available from Applied 

35 Biosystems, Inc., Foster City, CA) , can be employed to 

create and manage sequence assembly projects by assembling 
data from selected sequence fragments into a larger 
sequence. The Assembler software combines two advanced 
computer technologies which maximize the ability to 



wo 95/20681 



PCT/US95/01160 



assemble sequenced DNA fragments into Assemblages, a 
special grouping of data where the relationships between 
sequences are shown by graphic overlap, alignment and 
statistical views. The process is based on the 
5 Meyers-Kececioglu model of fragment assembly (INHERIT™ 
Assembler User's Manual, Applied Biosystems, Inc., Foster 
City, CA) , and uses graph theory as the foundation of a 
very rigorous multiple sequence alignment engine for 
assembling DNA sequence fragments. Other assembly programs 

10 that can be used include MEGALIGN (available from DNASTAR 
Inc., Madison, WI ) , basher and STADEN (available from Roger 
Staden, Cambridge, England) . 

Next, with reference to Fig. 2, we describe in more 
detail the "abundance sort" program which implements above- 

15 mentioned "step (b) " to tabulate the number of sequences of 
• the library which match each database entry (the "abundance 
number" for each database entry) . 

Fig. 2 is a flow chart of a preferred embodiment of 
the abundance sort program. A source code listing of this 

20 embodiment of the abundance sort program is set forth in 

Table 5. In the Table 5 implementation, the abundance sort 
program is written using the FoxBASE programming language 
commercially available from Microsoft Corporation. 
Although FoxBASE was the program chosen for the first 

25 iteration of this technology, it should not be considered 
limiting. Many other programming languages, Sybase being a 
particularly desirable alternative, can also be used, as 
will be obvious to one with ordinary skill in the art. The 
subroutine names specified in Fig. 2 correspond to 

30 subroutines listed in Table 5. 

With reference again to Fig. 2, the "Identified 
Sequences" are transcript sequences representing each 
sequence of the library and a corresponding identification 
of the database entry (if any) which it matches. In other 

35 words, the "Identified Sequences" are transcript sequences 
representing the output of above-discussed "step (a)." 

Fig. 3 is a block diagram of a system for implementing 
the invention. The Fig. 3 system includes library 
generation unit 2 which generates a library and asserts an 



wo 95/20681 PCT/US95/01160 

output stream of transcript sequences indicative of the 
biological sequences comprising the library. Programmed 
processor 4 receives the data stream output from unit 2 and 
processes this data in accordance with above-discussed 
5 "step (a)" to generate the Identified Sequences . Processor 
4 can be a processor programmed with the commercially 
available computer program known as the INHERIT 670 
iSequence Analysis System and the commercially available 
computer program known as the Factura program (both 

10 available from Applied Biosystems Inc.) and with the UNIX 
operating system. 

Still with reference to Fig. 3, the Identified 
Sequences are loaded into processor 6 which is programmed 
with the abundance sort program. Processor 6 generates the 

15 Final Transcript sequences indicated in both Figs. 2 and 3. 
Fig- 4 shows a more detailed block diagram of a planned 
relational computer system, including various searching 
techniques which can be implemented, along with an 
assortment of databases to query against. 

20 With reference to Fig. 2, the abundance sort program 

first performs an operation known as "Tempnum" on the 
Identified Sequences, to discard all of the Identified 
Sequences except those which match database entries of 
selected types. For example, the Tempnum process can 

25 select Identified Sequences which represent matches of the 
following types with database entries (see above for 
definition) : "exact" matches, human "homologous" matches, 
"other species" matches representing genes present in 
species other than human) , "no" matches (no significant 

30 regions of homology with database entries representing 
previously identified nucleotide sequences) , "I" matches 
(Incyte for not previously known DNA sequences) , or "X" 
matches (matches ESTs in reference database) . This 
eliminates the U, S, M, V, A, R and D sequence (see Table 1 

35 for definitions) . 

The identified sequence values selected during the 
"Tempnum" process then undergo a further selection (weeding 
out) operation known as "Tempred," This operation can, for 



30 



wo 95/20681 PCTAJS95/01160 

example, discard all identified sequence values 
representing matches with selected database entries. 

The identified sequence values selected during the 
"Tempred" process are then classified according to library, 
5 during the "Tempdesig" operation. It is contemplated that 
the "Identified Sequences" can represent sequences from a 
single library, or from two or more libraries. 

Consider first the case that the identified sequence 
values represent sequences from a single library. In this 

10 case, all the identified sequence values determined during 
"Tempred" undergo sorting in the "Templib" operation, 
further sorting in the "Libsort" operation, and finally 
additional sorting in the "Temptarsort" operation. For 
example, these three sorting operations can sort the 

15 identified sequences in order of decreasing "abundance 
number" (to generate a list of decreasing abundance 
numbers, each abundance number corresponding to a unique 
identified sequence entry, or several lists of decreasing 
abundance numbers, with the abundance numbers in each list 

20 corresponding to database entries of a selected type) with 
redundancies eliminated from each sorted list. In this 
case, the operation identified as "Cruncher" can be 
bypassed, so that the "Final Data" values are the organized 
transcript sequences produced during the "Temptarsort" 

25 operation. 

We next consider the case that the transcript 
sequences produced during the "Tempred" operation represent 
sequences from two libraries (which we will denote the 
"target" library and the "subtractant" library) . For 

30 example, the target library may consist of cDNA sequences 
from clones of a diseased cell, while the subtractant 
library may consist of cDNA sequences from clones of the 
diseased cell after treatment by exposure to a drug. For 
another example, the target library may consist of cDNA 

35 sequences from clones of a cell type from a young human, 

while the subtractant library may consist of cDNA sequences 
from clones of the same cell type from the same human at 
different ages. 



wo 95/20681 



PCT/US9S/01160 



In this case, the "Tempdesig" operation routes all 
transcript sequences representing the target library for 
processing in accordance with "Templib" (and then "Libsort" 
and "Temptarsort") , and routes all transcript sequences 
5 representing the subtractant library for processing in 
accordance with "Terapsub" (and then "Subsort" and 
"Tempsubsort") . For example, the consecutive "Templib," 
"Libsort," and "Temptarsort" sorting operations sort 
identified sequences from the target library in order of 

10 decreasing abundance number (to generate a list of 
decreasing abundance numbers, each abundance number 
corresponding to a database entry, or several lists of 
decreasing abundance numbers, with the abundance numbers in 
each list corresponding to database entries of a selected 

15 type) with redundancies eliminated from each sorted list. 
"The consecutive "Tempsub, " "Subsort," and "Tempsubsort" 
sorting operations sort identified sequences from the 
subtractant library in order of decreasing abundance number 
(to generate a list of decreasing abundance numbers, each 

20 abundance number corresponding to a database entry, or 
several lists of decreasing abundance numbers, with the 
abundance numbers in each list corresponding to database 
entries of a selected type) with redundancies eliminated 
from each sorted list. 

25 The transcript sequences output from the "Temptarsort" 

operation typically represent sorted lists from which a 
histogram could be generated in which position along one 
(e.g., horizontal) axis indicates abundance number (of 
target library sequences) , and position along another 

30 (e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). Similarly, the 
transcript sequences output from the "Tempsubsort" 
operation typically represent sorted lists from which a 
histogram could be generated in which position along one 

35 (e.g., horizontal) axis indicates abundance number (of 

subtractant library sequences) , and position along another 
(e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). 



32 



wo 95/20681 PCT/US95/01160 

The transcript sequences (sorted lists) output from 
the Tempsubsort and Temptarsort sorting operations are 
combined during the operation identified as "Cruncher." 
The "Cruncher" process identifies pairs of corresponding 
5 target and subtractant abundance numbers (both representing 
the same identified sequence value) , and divides one by the 
other to generate a "ratio" value for each pair of 
corresponding abundance numbers,, and then sorts the ratio 
values in order of decreasing ratio value. The data output 

lb from the' "Cruncher "^ Transcript 

sequence in Fig. 2) is typically a sorted list from which a 
histogram could be generated in which position along one 
axis indicates the size of a ratio of abundance numbers 
(for corresponding identified sequence values from target 

15 and subtractant libraries) and position along another axis 
indicates identified sequence value (e.g., gene type). 

Preferably, prior to obtaining a ratio between the two 
library abundance values, the Cruncher operation also 
divides each ratio value by the total number of sequences 

20 in one or both of the target and subtractant libraries. 

The resulting lists of "relative" ratio values generated by 
the Cruncher operation are useful for many medical, 
scientific, and industrial applications. Also preferably, 
the output of the Cruncher operation is a set of lists, 

25 each list representing a sequence of decreasing ratio 
values for a different selected subset (e.g. protein 
family) of database entries. 

In one example, the abundance sort program of the 
invention tabulates for a library the numbers of mRNA 

30 transcripts corresponding to each gene identified in a 

database. These numbers are divided by the total number of 
clones sampled. The results of the division reflect the 
relative abundance of the mRNA transcripts in the cell type 
or tissue from which they were obtained. Obtaining this 

35 final data set is referred to herein as "gene transcript 
image analysis." The resulting subtracted data show 
exactly what proteins and genes are upregulated and 
downregulated in highly detailed complexity. 



33 



wo 95/20681 PCTAJS95/01160 

6.6. HUVEC cDNA LIBRTOIY 

Table 2 is an abundance table listing the various gene 
transcripts in an induced HUVEC library. The transcripts 
are listed in order of decreasing abundance. This 
5 computerized sorting simplifies analysis of the tissue and 
speeds identification of significant new proteins which are 
specific to this cell type. This type of endothelial cell 
lines tissues of the cardiovascular system, and the more 
that is known about its composition, particularly in 
id' response to activation, the more choices of protein targets 
become available to affect in treating disorders of this 
tissue, such as the highly prevalent atherosclerosis, 

6.7. MONOCYTE-CELL AND MAST-CELL cDNA LIBRARIES 

Tables 3 and 4 show truncated comparisons of two 

15 libraries. In Tables 3 and 4 the "normal monocytes" are 
the HMC-1 cells, and the "activated macrophages" are the 
THP-1 cells pretreated with PMA and activated with LPS. 
Table 3 lists in descending order of abundance the most 
abundant gene transcripts for both cell types. With only 

20 15 gene transcripts from each cell type, this table permits 
quick, qualitative comparison of the most common 
transcripts. This abundance sort, with its convenient 
side-by-side display, provides an immediately useful 
research tool. In this example, this research tool 

25 discloses that 1) only one of the top 15 activated 
macrophage transcripts is found in the top 15 normal 
monocyte gene transcripts (poly A binding protein); and 2) 
a new gene transcript (previously unreported in other 
databases) is relatively highly represented in activated 

30 macrophages but is not similarly prominent in normal 

macrophages. Such a research tool provides researchers 
with a short-cut to new proteins, such as receptors, cell- 
surface and intracellular signalling molecules, which can 
serve as drug targets in commercial drug screening 

35 programs. Such a tool could save considerable time over 
that consumed by a hit and miss discovery program aimed at 
identifying important proteins in and around cells, because 
those proteins carrying out everyday cellular functions and 

34 



wo 95/20681 PCT/DS95/01160 

represented as steady state mRNA are quickly eliminated 
from further characterization. 

This illustrates how the gene transcript profiles 
change with altered cellular function. Those skilled in 
5 the art know that the biochemical composition of cells also 
changes with other functional changes such as cancer, 
including cancer's various stages, and exposure to 
toxicity. A gene transcript subtraction profile such as in 
Table 3 is useful as a first screening tool for such gene 
10 expression "and protein studies. 

6.8. SUBTRACTION ANALYSIS OF NORMAL MONOCYTE-CELL AND 
ACTIVATED MONOCYTE CELL cDNA LIBRARIES 

Once the cDNA data are in the computer, the computer 
program as disclosed in Table 5 was used to obtain ratios 

15 of all the gene transcripts in the two libraries discussed 
in Example 6.7, and the gene transcripts were sorted by the 
descending values of their ratios. If a gene transcript is 
not represented in one library, that gene transcript's 
abundance is unknown but appears to be less than 1. As an 

20 approximation — and to obtain a ratio, which would not be 
possible if the unrepresented gene were given an abundance 
of zero — genes which are represented in only one of the 
two libraries are assigned an abundance of 1/2. Using 1/2 
for unrepresented clones increases the relative importance 

25 of "turned-on" and "turned-off" genes, whose products would 
be drug candidates. The resulting print-out is called a 
subtraction table and is an extremely valuable screening 
method, as is shown by the following data. 

Table 4 is a subtraction table, in which the normal 

30 monocyte library was electronically "subtracted" from the 
activated macrophage library. This table highlights most 
effectively the changes in abundance of the gene 
transcripts by activation of macrophages. Even among the 
first 20 gene transcripts listed, there are several unknown 

35 gene transcripts. Thus, electronic subtraction is a useful 
tool with which to assist researchers in identifying much 
more quickly the basic biochemical changes between two cell 
types. Such a tool can save universities and 
pharmaceutical companies which spend billions of dollars on 



wo 95/20681 PCT/US9S/01160 

research valuable time and laboratory resources at the 
early discovery stage and can speed up the drug development 
cycle, which in turn permits researchers to set up drug 
screening programs much earlier. Thus, this research tool 
5 provides a way to get new drugs to the public faster and 
more economically. 

Also, such a subtraction table can be obtained for 
patient diagnosis. An individual patient sample (such as 
monocytes obtained from a biopsy or blood sample) can be 

10 ' compared with data providied herein to diagnose conditions 
associated with macrophage activation. 

Table 4 uncovered many new gene transcripts (labeled 
Incyte clones) . Note that many genes are turned on in the 
activated macrophage (i.e., the monocyte had a 0 in the 

15 bgfreq column) . This screening method is superior to other 
screening techniques, such as the western blot, which are 
incapable of uncovering such a multitude of discrete new 
gene transcripts. 

The subtraction-screening technique has also uncovered 

20 a high number of cancer gene transcripts (oncogenes rho, 
ETS2, rab-2 ras, YPTl-related, and acute myeloid leukemia 
mRNA) in the activated macrophage. These transcripts may 
be attributed to the use of immortalized cell lines and are 
inherently interesting for that reason. This screening 

25 technique offers a detailed picture of upregulated 

transcripts including oncogenes, which helps explain why 
anti-cancer drugs interfere with the patient's immunity 
mediated by activated macrophages. Armed with knowledge 
gained from this screening method, those skilled in the art 

30 can set up more targeted, more effective drug screening 
programs to identify drugs which are differentially 
effective against 1) both relevant cancers and activated 
macrophage conditions with the same gene transcript 
profile; 2) cancer alone; and 3) activated macrophage 

35 conditions. 

Smooth muscle senescent protein (22 kd) was 
upregulated in the activated macrophage, which indicates 
that it is a candidate to block in controlling 
inflammation. 

36 



wo 95/20681 



PCTAJS95/01160 



6.9. SUBTRACTION ANALYSIS OF NORMAL LIVER CELLS AND 
HEPATITIS INFECTED LIVER CELL cDNA LIBRARIES 

In this example, rats are exposed to hepatitis virus 
and maintained in the colony until they show definite signs 
5 of hepatitis. Of the rats diagnosed with hepatitis, one 
half of the rats are treated with a new anti-hepatitis 
agent (AHA) . Liver samples are obtained from all rats 
before exposure to the hepatitis virus and at the end of 
AHA treatment or no treatment. In addition/ liver samples 
10 can be obtained from rats with hepatitis just prior to AHA 
■ *' -■ treatment'* • " <■'- - ..-^ .. 

The liver tissue is treated as described in Examples 
6.2 and 6.3 to obtain mRNA and subsequently to sequence 
cDNA. The cDNA from each sample are processed and analyzed 

15 for abundance according to the computer program in Table 5. 
The resulting gene transcript images of the cDNA provide 
detailed pictures of the baseline (control) for each animal 
and of the infected and/or treated state of the animals. 
cDNA data for a group of samples can be combined into a 

20 group summary gene transcript profile for all control 
samples, all samples from infected rats and all samples 
from AHA- treated rats. 

Subtractions are performed between appropriate 
individual libraries and the grouped libraries. For 

25 individual animals, control and post-study samples can be 
subtracted. Also, if samples are obtained before and after 
AHA treatment, that data from individual animals and 
treatment groups can be subtracted. In addition, the data 
for all control samples can be pooled and averaged. The 

30 control average can be subtracted from averages of both 
post-study AHA and post-study non-AHA cDNA samples. If 
pre- and post-treatment samples are available, pre- and 
post-treatment samples can be compared individually (or 
electronically averaged) and subtracted. 

35 These subtraction tables are used in two general ways. 

First, the differences are analyzed for gene transcripts 
which are associated with continuing hepatic deterioration 
or healing. The subtraction tables are tools to isolate 
the effects of the drug treatment from the underlying basic 

40 pathology of hepatitis. Because hepatitis affects many 



wo 95/20681 



PCT/US9S/01160 



parameters, additional liver toxicity has been difficult to 
detect with only blood tests for the usual enzymes. The 
gene transcript profile and subtraction provides a much 
more complex biochemical picture which researchers have 
5 needed to analyze such difficult problems. 

Second, the subtraction tables provide a tool for 
identifying clinical markers, individual proteins or other 
biochemical determinants which are used to predict and/or 
evaluate a clinical endpoint, such as disease, improvement 

10 due to the drug, and even additional pathology due to the 
' drug. The subtraction tables specifically highlight genes 
which are turned on or off. Thus, the subtraction tables 
provide a first screen for a set of gene transcript . 
candidates for use as clinical markers. Subsequently, 

15 electronic subtractions of additional cell and tissue 

libraries reveal which of the potential markers are in fact 
found in different cell and tissue libraries. Candidate 
gene transcripts found in additional libraries are removed 
from the set of potential clinical markers. Then, tests of 

20 blood or other relevant samples which are known to lack and 
have the relevant condition are compared to validate the 
selection of the clinical marker. In this method, the 
particular physiologic function of the protein transcript 
need not be determined to qualify the gene transcript as a 

25 clinical marker. 

6.10. ELECTRONIC NORTHERN BLOT 

One limitation of electronic subtraction is that it is 
difficult to compare more than a pair of images at once. 
Once particular individual gene products are identified as 

30 relevant to further study (via electronic subtraction or 
other methods) , it is useful to study the expression of 
single genes in a multitude of different tissues. In the 
lab, the technique of "Northern" blot hybridization is used 
for this purpose. In this technique, a single cDNA, or a 

35 probe corresponding thereto, is labeled and then hybridized 
against a blot containing RNA samples prepared from a 
multitude of tissues or cell types. Upon autoradiography. 



38 



wo 95/20681 



PCT/US95/01160 



second set is indicative of one of the biological sequences 
of the second library. Then the second set of transcript 
sequences is processed in a progranmed computer to generate 
a second set of identified sequence values, namely the 
5 further identified sequence values, each of which is 

indicative of a sequence annotation and includes a degree 
of match between one of the biological sequences of the 
second library and at least one of the reference sequences. 
The further identified sequence values are processed to 
10 generate further final data values .indicative of the number 
of times each further identified sequence value is present 
in the second library. The final data values from the 
first specimen and the further identified sequence values 
from the second specimen are processed to generate ratios 
15 of transcript sequences, which indicate the differences in 
the number of gene transcripts between the two specimens. 

In a further - embodiment, the method includes 
quantifying the relative abundance of mRNA in a biological 
specimen by (a) isolating a population of mRNA transcripts 
20 from a biological specimen; (b) identifying genes from 
which the mRNA was transcribed by a sequence-specific 
method; (c) determining the numbers of mRNA transcripts 
corresponding to each of the genes; and (d) using the mRNA 
transcript numbers to determine the relative abundance of 
25 mRNA transcripts within the population of mRNA transcripts. 

Also disclosed is a method of producing a gene 
transcript image analysis by first obtaining a mixture of 
mRNA, from which cDNA copies are made. The cDNA is 
inserted into a suitable vector which is used to transfect 
30 suitable host strain cells which are plated out and 

permitted to grow into clones, each cone representing a 
unique mRNA. A representative population of clones 
transfected with cDNA is isolated. Each clone in the 
population is identified by a sequence-specific method 
35 which identifies the gene from which the unique mRNA was 
transcribed. The number of times each gene is identified 
to a clone is determined to evaluate gene transcript 
abundance. The genes and their abundances are listed in 
order of abundance to produce a gene transcript image. 

8 



wo 95/20681 



PCT/US95/01160 



the pattern of expression of that particular gene, one at a 
time, can be quantitated in all the included samples, 

In contrast, a further embodiment of this invention is. 
the computerized form of this process, termed here 
5 "electronic northern blot." In this variation, a single 
gene is queried for expression against a multitude of 
prepared and sequenced libraries present within the 
database. In this way, the pattern of expression of any 
single candidate gene can be examined instantaneously and 

10 effortlessly. More candidate genes can thus be scanned, 
leading to more frequent and fruitfully relevant 
discoveries. The computer program included as Table 5 
includes a program for performing this function, and Table 
6 is a partial listing of entries of the database used in 

15 the electronic northern blot analysis. 

6.11. PHASE I CLINICAL TRIALS 

Based on the establishment of safety and effectiveness 
in the above animal tests. Phase I clinical tests are 
undertaken. Normal patients are subjected to the usual 

20 preliminary clinical laboratory tests. In addition, 
appropriate specimens are taken and subjected to gene 
transcript analysis. Additional patient specimens are 
taken at predetermined intervals during the test. The 
specimens are subjected to gene transcript analysis as 

25 described above. In addition, the gene transcript changes 
noted in the earlier rat toxicity study are carefully 
evaluated as clinical markers in the followed patients. 
Changes in the gene transcript analyses are evaluated as 
indicators of toxicity by correlation with clinical signs 

30 and symptoms and other laboratory results. In addition, 
subtraction is performed on individual patient specimens 
and on averaged patient specimens. The subtraction 
analysis highlights any toxicological changes in the 
treated patients. This is a highly refined determinant of 

35 toxicity. The subtraction method also annotates clinical 
. markers. Further subgroups can be analyzed by subtraction 
analysis, including, for example, 1) segregation by 



wo 95/20681 



PCT/US95/01160 



occurrence and type of adverse effect; and 2) segregation 
by dosage • 

« 

6.12. GENE TRANSCRIPT IMAGING ANALYSIS IN CLINICAL STUDIES 

A gene transcript imaging analysis (or multiple gene 
5 transcript imaging analyses) is a useful tool in other 
clinical studies. For example, the differences in gene 
transcript imaging analyses before and after treatment can 
be assessed for patients on placebo and drug treatment. 
This method also effectively screens for clinical markers 
10 to follow in clinical use of the drug. 

6-13. COMPARA TIVE GENE TRANSCRIPT ANALYSIS BETWEEN SPECIES 

The subtraction method can be used to screen cDNA 
libraries from diverse sources. For example, the same cell 
types from different species can be compared by gene 

15 transcript analysis to screen for specific differences, 
such as in detoxification enzyme systems. Such testing 
aids in the selection and validation of an animal model for 
the commercial purpose of drug screening or toxicological 
testing of drugs intended for human or animal use. When 

20 the comparison between animals of different species is 

shown in columns for each species, we refer to this as an 
interspecies comparison, or zoo blot. 

Embodiments of this invention may employ databases 
such as those written using the FoxBASE programming 

25 language commercially available from Microsoft Corporation, 
other embodiments of the invention employ other databases, 
such as a random peptide database, a polymer database, a 
synthetic oligomer database, or a oligonucleotide database 
of the type described in U.S. Patent 5,270,170, issued 

30 December 14, 1993 to Cull, et al., PCT International 

Application Publication No. WO 9322684, published November 
11, 1993, PCT International Application Publication No. WO 
9306121, published April 1, 1993, or PCT International 
Application Publication No. WO 9119818, published December 

35 26, 1991. These four references (whose text is 

incorporated herein by reference) include teaching which 



40 



wo 95/20681 



PCTAJS95/01160 



may be applied in implementing such other embodiments of 
the present invention. 

All references referred to in the preceding text are 
hereby expressly incorporated by reference herein. 
5 Various modifications and variations of the described 

method and system of the invention will be apparent to 
those skilled in the art without departing from the scope 
and spirit of the invention. Although the invention has 
been described in connection with specific preferred 
10 embodiments, it should be understood that the invention as 
claimed should not be unduly limited to such specific 
embodiments. 



wo 95/20681 



PCT/US9S/01160 



c 
o 

•rl ^ 
U 

c 
o 



01 

c c 

CQ JJ 

0) O 

U 1-1 

o a 

OU iH 
(Q 

C E 

i-H -H O 

o) (U - n 

C JJ o 

CO O ^ 

^ eu oc 



c 
o 

-H 
XJ 



c 

JJ 

C71 tt) 

C E 

-H 0) 

C 0) 



C QU 

QJ H 

ty\ CD 
O 

U a 
B 

O O 



(0 
VI 



0) 
iJ 

0) CO 

03 

(0 0) 

u u 

CO 

a <u 

CO en 

O 

cu c 

^ CO 

m u 

CO o 

C E 

fl 0 



c 
o 

u 

a 

u 
cn 
c 

(0 

JJ 
to \ 
c 

o c 

o *o 
u c 

q 0 



u 
o 

u 
CI 

u 



o 
c 



«H 
3 
U 
0) 

o 

3 CO 
CO CJ 



CO 0) 
U JJ 



o 

u 
a 



o 
JJ 
u 

0) 

tw cn 
0) C 

o 



03 a 

•O CD 



c 

CO 

Ol 



CQ Q CO CJ 



03 
iJ 01 
Q) 

D M 
iJ 

CO CO 



C 

o 

JJ 
(0 



01 



o 



c jr 

-H c 

JJ ^ 

o ay 

U GO 

a CO 

O 01 



iJ 
B 



U 

u 



JJ 
o 
u 



^ Ct4 CX* 

NOD 

m lu 04 



o o 

x: £ ^ 

a tn CO 

0) 

O rH 
£ O 

to TJ 

<y JJ -H 

> 0) O 

•H £ CO 
jJ 

(0 M o 

CO c 

-H Ol 

=^ s 

CO < 

a n 



o 



£ 

01 



O 

Xi 

JJ £ 

0) 0) 

e -H 

O 

-H 

U CO #-t 

(0 JJ CO 

0) Vi 

U £ D 

01 'O U V4 

V O4 U ^ 
O .H JJ JJ 
Z C/J O 



a 

o 
c 

c 

3 



E- iJ c; o 



X CO 



c 
o 

(0 



.J 

9 



i-t N 
CO 
U 
O 



0) 

E 

C (U CO CO 

O O O '-^ •«-' 

•H JJ CO 3 U 
E 0) *w iH 

03 n-l U C 

1-1 CO 0) D O 0) 

to iH ^ 01 U ^ iJ 

0) a CO (0 cj 01 

,-1 O O f-l Vf o 



c 

o 
c 



u 

01 



UjJjJ»-HJJJJ U M 

3>i>ia)Ci-ta)CJJ 

nDnnnnoDQ 
ZObiWNSCODX 



m ^ 
D H 
JJ 

CD 

to 



01 

c 

JJ 0) o 

01 -H "O 

01 CO 

W >, CO 
01 

iJ 
c 



(0 01 
c >, 
CO 
CO 



XJ 

c 

01 CO 



>i c 

u to 



U CO 

a £ 



O O M 
2 Q 0^ 



01 

u 

c 01 

0) -H 
P 10 

cr c 

OJ i-H U 

0) CO 01 

c 

£ CO JJ 
JJ u 
01 >i o 
cue 
01 td 

C D 
i-i O CO 
l-^ a w 

D 01 -H 

tx» CO E-« 
n D D 



JJ 
cn 
c 

01 



CO 
JJ 
X3 
O 

U 



O r-t CN n ^ . in VD 



o 

-H 

i3 Cu 
-H 
U 

m 
Q 



U 

U 
0) 

a 

CO 



01 

-«-» to 

U 01 

01 -rH C 

a XJ 5 

CO o 

I t-H c 

C »H ^ 

O 01 c 

Z U D 



U C3U D 



01 

JJ 

CO 

c u 

CO 

M C -H 01 
0) ID 13 JJ 

01 jj >; -H w 



ta CO 

01 ^ 
-H 

u 

0) 

a 

CO 



c 
to 
o 

N 

o 

03 01 u x: 01 XJ C71 
D E -H a > o c 
o CO jc e C U 3 



01 JJ 

c c: -r^ 

(0. XI 
6 01 Ol 01 > X3 JJ 

oa-HOOco CO _ -i- -s « 

unnaDnaniDnnuQ 
x<ouo>fflc:scocjc*-Mt>iu 



c 

0 

XJ ^ 

la Q 
c ^ 
tJi 

o 

0 

O 



u 
to 



0) 

o 



CO 
0) 

u 

0) 

a x: 



0) 

c 

01 

cn 0) 

O) X3 
C to 



tJl CO U *H 



o 

Jj (-H M 

O 01 

E x: 

O JJ 



JJ T5 CO 
CO O 01 
£ U 

c 
o 



Ci3 X O Z Z 



1-1 

c 
o 
z 



< 
z 

Q 

0) r-l 

> c 

•H O 

JJ 

JJ I 

a i-H 

01 o 



z 

Q 



CO 
1-1 

C TJ 
O C 
O 

U SI 
O U 
XJ O 
U XJ 

OJ -H 
> S 



01 

c 
o 

iH 

u 

01 
XJ 

c u 

M JJ 

CO 

x: E 
a o ^ 

•H JJ H 

^ to CO 

CO 2: w 



aaoaanDDnonao 
Ct:xozQDoi*<>SCO>HX 



^ 2 



to 
u 

X3 



01 

U TJ 

C 

«H O 01 03 O 

r« I CO 01 01 c 

mUa*>i-H C^a OJ 

a^ 3: X iD a 3 "O 
DXHXCOJe-i< 



D D a 
:d 2: E- 



n n. D % D 
X CO .J >< < 



SUBSTITUTE SHEET (RULE 26) 



wo 95/20681 



PCTAJS95/01160 



TABLE 2 



Clone numbers 15000 through 20000 

Libraries: HUVEC 

Arranged by ABUNDANCE 

Total clones analyzed: 5000 

319 genes, for a total of 1713 Clones 





number 


N 


c 


entry s 


descriptor 


1 


15365 


67 




HSRPL4 1 


Riboptn L41 


2 


15004 


65 




NCY015004 


INCYTE 015004 


3 


15638 


63 




NCY015638 


INCYTE 015638 


4 


15390 


50 




NCy015390 


INCYTE 015390 


5 


15193 


47 




HSFIBl 


Fibronectin 


6 


15220 


47 




RRRPL9 R 


Riboptn L9 


7 


15280 


47 




NCY015280 


INCYTE 015280 


8 


15583 


33 




M62060 


EST HHCH09 (ICR) 


9 


15662 


31 




HSACTCGR 


Act in, gamma . 


10 


15026 


29 




NCY015026 


INCYTE 015026 


11 


15279 


24 




HSEFIAR 


Elf 1-alpha 


12 


15027 


23 




NCY015027 


INCYTE 015027 


13 


15033 


20 




NCY015033 


INCYTE 015033 


14 


15198 


20 




NCY015198 


INCYTE 015198 


15 


15809 


20 




HSCOLLl 


Collagenase 


16 


15221 


19 




NCY015221 


INCYTE 015221 


17 


15263 


19 




NCY015263 


INCYTE 015263 


18 


15290 


19 




NCY015290 


INCYTE 015290 


19 


15350 


18 




NCY015350 


INCYTE 015350 


20 


15030 


17 




NCY015030 


INCYTE 015030 


21 


15234 


17 




NCY015234 


INCYTE 015234 


22 


15459 


16 




NCY015459 


INCYTE 015459 


23 


15353 


15 




NCY015353 


INCYTE 015353 


24 


15378 


15 




S76965 


Ptn kinase inhib 


25 


15255 


14 




HUMTHYB4 


Thymosin beta-4 


26 


15401 


14 




HSLIPCR 


Lipocortin I 


27 


15425 


14 




HSPOLYAB 


Poly-A bp 


28 


18212 


14 




HUMTHYMA 


Thymosin, alpha 


29 


18216 


14 




HSMRPl 


Motility relat ptn; MRP-1;CD 


30 


15189 


13 




HS18D 


Interferon indue ptn 1-8D 


31 


15031 


12 




HUMFKBP 


FK506 bp 


32 


15306 


12 




HSH2AZ 


Histone H2A 


33 


15621 


12 




HUMLEC 


Lectin, B-galbp, 14kDa 


34 


15789 


11 




NCY015789 


INCYTE 015789 


35 


16578 


11 




HSRPSll 


Riboptn Sll 


36 


16632 


11 




M61984 


EST HHCA13 (IGR) 


37 


18314 


11 




NCY018314 


INCYTE 018314 


38 


15367 


10 




NCY015367 


INCYTE 015367 


39 


15415 


10 




HSIFNINl 


interferon indue mRNA 


40 


15633 


10 




HSLDHAR 


Lactate dehydrogenase 


41 


15813 


10 




CHKNMHCB 


C Myosin heavy chain B 


42 


18210 


10 




NCY018210 


INCYTE 018210 


43 


18233 


10 




HSRPII140 


RNA polymerase II 


44 


18996 


10 




NCY018996 


INCYTE 018996 


45 


15088 


9 




HUMFERL 


Ferritin, light chain 


46 


15714 


9 




NCY015714 


INCYTE 015714 


47 


15720 


9 




NCY015720 


INCYTE 015720 


48 


15663 


9 




NCY015863 


INCYTE 015863 


49 


16121 


9 




HSET 


Endothelin 


50 


18252 


9 




NCY018252 


INCYTE 018252 


51 


15351 


8 




HUMALBP 


Lipid bp, adipocyte 


52 


15370 


8 




NCY015370 


INCYTE 015370 



k 3 



wo 95/20681 



PCTAJS95/01160 



TABLE 2 Con't 





number 


N 


53 


15670 


8 


54 


15795 


8 


55 


16245 


8 


56 


18262 


8 


57 


18321 


8 


58 


15126 


7 


59 


15133 


7 


60 


15245 


7 


61 


15288 


7 


62 


15294 


7 


63 


15442 


7 


64 


15485 


7 


65 


16646 


7 


66 


18003 


7 


67 


15032 


6 


68 


15267 


6 


69 


15295 


6 


70 


15458 


6 


71 


15832 


6 


72 


15928 


6 


73 


16598 


6 


74 


18218 


6 


75 


18499 


6 


76 


18963 


6 


11 


18997 


6 


78 


15432 

» ^ 


5 


79 


15475 


5 


80 


15721 


5 


81 


15865 


5 


82 


16270 


5 


83 


16886 


5 


84 


18500 


5 


85 


18503 


5 


86 


19672 


5 


87 


15086 


4 


88 


15113 


4 


89 


15242 


4 


90 


15249 


4 


91 


15377 


4 


92 


15407 


4 


93 


15473 


4 


94 


15588 


4 


95 


15684 


4 


96 


15782 


4 


97 


15916 


4 


98 


15930 


4 


99 


16108 


4 


100 


16133 


4 



entry 

BTCIASHI 

NCY015795 

NCy016245 

NCy018262 

HSRPL17 

XLRPLIBRF 

HSAC07 

NCY015245 

NCY015288 

HSGAPDR 

HUMLAMB 

HSNGMRNA 

NCY016646 

HUMPAIA 

HUMUB 

HSRPS8 

NCY015295 

RNRPSIOR 

RSGALEM 

HUMAPOJ 

HUMTBBM40 

NCy018218 

HSP27 

NCy018963 

NCY018997 

H5AGALAR 

NCY015475 

NCY015721 

NCY015865 

NCY016270 

NCY016886 

NCY018500 

NCY018503 

RRRPL34 

XLRPLIAR 

HUMIFNWRS 

NCY015242 

NCY015249 

NCy015377 

NCY015407 

NCY015473 

HSRPS12 

HSEFIG 

NCY015782 

HSRPS18 

NCY015930 

NCY016108 

NCY016133 



s 
V 



R 
R 



R 
F 



descriptor 

NADH-ubiq oxidoreductase 

INCYTE 015795 

INCYTE 016245 

INCYTE 018262 

Riboptn L17 

Riboptn LI 

Act in, beta 

INCYTE 015245 

INCYTE 015288 

G-3-PD 

Laminin receptor, 54kDa 
Uracil DNA glycosylase 
INCYTE 016646 
Plsmnogen activ gene 
Ubiquitin 
Riboptn S8 
INCYTE 015295 
Riboptn SIO 

UDP-galactose epimerase 

Apolipoptn J 

Tubulin, beta 

INCYTE 018218 

Hydrophobic ptn p27 

INCYTE 018963 

INCYTE 018997 

Galactosidase A, alpha 

INCYTE 015475 

INCYTE 015721 

INCYTE 015865 

INCYTE 016270 

INCYTE 016886 

INCYTE 018500 

INCYTE 018503 

Riboptn L34 

Riboptn LI a 

tRNA synthetase, trp 

INCYTE 015242 

INCYTE 015249 

INCYTE 015377 

INCYTE 015407 

INCYTE 015473 

Riboptn S12 

Elf l-gamma 

INCYTE 015782 

Riboptn S18 

INCYTE 015930 

INCYTE 016108 

INCYTE 016133 



4 



WO 95/20681 PCTAJS9S/01160 



I 

c 

o 
a 

o 
E 



O 



3 
U 



u 

< 
> 

u 
< 



0) 0= 



M c 

D. 3 

O J 

y (0 



I 

c 

3 



c 

•(J 

> i 

aw" 

EC ™ 



o 

1^ « 

w to 

= c 

S| 

HZ 

V) 

o 



(0 

*^ "(5 

O ft 

S " 

Jm II 

O Z 

O O 



in 

OT3 



O 
O E 

3 *^ 

E 

M 0) 



c 

U 

u 

S £u 

Oh5 



in J5 

N g 

3 "D 

u< 



T3 
C « 

.2 

|2 

U I 

c 



(U M 

U. 45 



"o 

c c 

g « o 

2 'W 
O O 



o 
o 

E 

o 



u 
4S 



c 



c 



< 

O 
Z 



c 
o 



fi O oO-n 
c O o J. 

uj 0^ ce CD u. 



o.£ g 

Q. E Q- 

ni fS 

O ?i o 
tf) VI 

o"i3 O 

Ji 3-0 

qSzS 



o 

E 

3 

■o 

.£^ 

is « h 

Q. &. p 

»-.£ rt 

C "O c 

•C .£ .2 

»2 JS 

sis 

1-0.1- 



Q. 

0) 



in t 

C c 

0) o 



o N o 

"c5 u I *rt 

E S. 0) E 
o_ c o 

S 5 o S 

AMM 

tCi/iXtC 



A 5 



wo 95/20681 



PCT/US95/01160 



TABLE 4 



Libraries: THP-1 

Subtracting: HMC 

sorted by ABUNDANCE 

Total clones analyzed: 7375 

1057 genes, for a total of 2151 clones 



number 


entry 


s descriptor 


bgf req 


rfend 


ratio 


10022 


HUMILl 


XL 1-beta 


0 


131 


262.00 


10036 


HSMDNCF 


IL-8 


0 


119 


238.00 


10089 


HSLAGICDN 


Lymphocyte activ gene 


0 


71 


142.00 


10060 


HUMTCSM 


RANTES 


0 


23 


46.000 


10003 


HUMMIPIA 


MIP-1 


3 


121 


40. 333 


10689 


HSOP 


Osteopontin 


0 


20 


40.000 


11050 


NCY011050 


INCYTE 011050 


0 


17 


34.000 


10937 


HSTNFR 


TNF-alpha 


0 


17 


34 OOO 


10176 


HSSOD 


Suoeroxide dismutase 


0 


14 


^ 0 . 


10886 


HSCDW40 


B-cell activ. NGF—relat 


0 


in 




10186 


HUMAPR 


Earlv resD PMA— indue 


0 






10967 


HUMGDN 


PN-1, qlial-deriv 


0 


g 


18 000 


11353 


NCY011353 


INCYTE 011353 


0 


ft 




10298 


NCY010298 


INCYTE 010298 


0 


7 


14 000 


10215 


HUM4COLA 


Collagenase, type IV 


0 


6 


12 .000 


10276 


NCy010276 


INCYTE 010276 


0 


6 


12 - 000 


10488 


NCy010488 


INCYTE 010488 


0 


6 


12 . 000 


11138 


NCY011138 


INCYTE 011138 


0 


6 


12 * 000 


10037 


HUMCAPPRO 


Adenylate cyclase 


1 


10 


10. 000 


10840 


HUMADCY 


Adenylate cyclase 


0 


5 


10.000 


10672 


HSCD44E 


Cell adhesion qlptn 


0 


5 


10 000 


12837 


HUMCYCLOX 


Cyclooxygenase-2 


0 


5 


10. 000 


10001 


NCYOlOOOl 


INCYTE 010001 


0 


5 


10.000 


10005 


NCYOIOOOS 


INCYTE 010005 


0 


5 


10.000 


10294 


NCy010294 


INCYTE 010294 


0 


5 


10.000 


10297 


NCy010297 


INCYTE 010297 


0 


5 


10.000 


10403 


NCy010403 


INCYTE 010403 


0 


5 


10.000 


10699 


NCY010699 


INCYTE 010699 


0 


5 


10.000 


10966 


NCY010966 


INCYTE 010966 


0 


5 


10.000 


12092 


NCY012092 


INCYTE 012092 


0 


5 


10.000 


12549 


HSRHOB 


Oncogene rho 


0 


5 


10.000 


10691 


HUMARFIBA 


ADP-ribosylation fctr 


0 


4 


8.000 


12106 


HSADSS 


Adenylosuccinate synthetase 


0 


4 


8.000 


10194 


HSCATHL 


Cathepsin L 


0 


4 


8.000 


10479 


CLMCYCA 


I Cyclin A 


0 


4 


8.000 


10031 


NCY010031 


INCYTE 010031 


0 


4 


8.000 


10203 


NCY010203 


INCYTE 010203 


0 


4 


8.000 


10288 


NCY010288 


INCYTE 010288 


0 


4 


8.000 


10372 


NCy010372 


INCYTE 010372 


0 


4 


8.000 


10471 


NCY010471 


INCYTE 010471 


0 


4 


8.000 


10484 


NCY010484 


INCYTE 010484 


0 


4 


8.000 


10859 


NCy010859 


INCYTE 010859 


0 


4 


8.000 


10890 


NCy010890 


INCYTE 010890 


0 


4 


8.000 


11511 


NCY011511 


INCYTE 011511 


0 


4 


8.000 


11868 


NCY011868 


INCYTE 011868 


0 


4 


8.000 


12820 


NCY012820 


INCYTE 012820 


0 


4 


8.000 


10133 


HSIIRAP 


IL-1 antagonist 


0 


4 


8.000 


10516 


HUMP2A 


Phosphatase, regul 2 A 


0 


4 


8.000 


11063 


HUMB94 


TNF-induc response 


0 


4 


8.000 


11140 


HSHB15RNA 


HB15 gene; new Ig 


0 


3 


6.000 


10788 


NCY001713 


INCYTE 001713 


0 


3 


6.000 


10033 


NCY010033 


INCYTE 010033 


0 


3 


6.000 


10035 


NCY010035 


INCYTE 010035 


0 


3 


6.000 


10084 


NCY010084 


INCYTE 010084 


0 


3 


6.000 


10236 


NCY010236 


INCYTE 010236 


0 


3 


6.000 


10383 


NCY010383 


INCYTE 010383 


0 


3 


6.000 



U 6 



wo 95/20681 



PCT/US95/01160 



TABLE 4 Con't 





« 1 ^ Jb jr D 


descriptor 




r f end 


ra.t.i.0 


10450 


NCY010450 


INCYTE 


010450 


0 


3 


6. 000 


10470 


NCY010470 


INCYTE 


010470 


0 




6 000 

W • Www 


1 0504 


NCY010504 


INCYTE 


010504. 






6 000 

\J • www 


1 0507 


NCY010507 


INCYTE 


010507 


0 




6 nnn 

w • www 


1 n5Qft 


NCYOl 05Pft 

l^wX WXww70 


INCYTE 


010598 


n 
\j 




6 nnn 

w • www 




Npvm n77Q 


INCYTE 


010779 


n 
\j 


■3 


A nnn 

o • UUU 


1 OQOQ 


NCYOIOQOQ 


INCYTE 


010909 


n 


■? 


fi nnn 


10976 


NCY010976 


INCYTE 


010976 


n 


■3 


6 nnn 




NCYOl OQS5 

l^^X wXw70S 


INCYTE 


010985 


n 


■a 
o 


A nnn 


1 1 055 


IV ^X wXXWS^ 


INCYTE 


011052 


n 




A nnn 


1 1068 


NCY011068 


INCYTE 


011068 


n 


•a 


ft nnn 


11134 


NCY011134 

VI V X V/ X X X ^ *T 


INCYTE 


011134 


n 


•a 


A nnn 


11136 


NCY011136 

A^V^A WXXXi^O 


INCYTE 


011136 


n 
\j 


■a 


A nnn 


XX X 7 X 


li^Jl WXXX7X 


INCYTE 


011191 


n 




c nnn 




NCYOl 1219 

A^V^A WXXAX7 


INCYTE 


011219 


n 


•a 


A nnn 


11386 


NCY011386 


INCYTE 


011386 


0 


3 


6.000 


11403 


NCy011403 


INCYTE 


011403 


0 


3 


6.000 


11460 


NCY011460 


INCYTE 


011460 


0 


3 


6.000 


11618 


NCY011618 


INCYTE 


011618 


0 


3 


6.000 


11686 


NCY011686 


INCYTE 


011686 


0 


3 


6.000 


12021 


NCY012021 


INCYTE 


012021 


0 


3 


6.000 


12025 


NCY012025 


INCYTE 


012025 


0 


3 


6.000 


12320 


NCy012320 


INCYTE 


012320 


0 


3 


6.000 


12330 


NCY012330 


INCYTE 


012330 


0 


3 


6.000 


12853 


NCy012853 


INCYTE 


012853 


0 


3 


6.000 


. 14386 


NCy014386 


INCYTE 


014386 


0 


3 


6.000 


14391 


NCy014391 


INCYTE 


014391 


0 


3 


6.000 



U 7 



wo 95/20681 



PCTAJS95/01160 



TABLE g 



* Masber maxiu for BUSTHACTIOtY output 

SBT SAPBTlf OFF 
5ZP EXACT CN . 
GET TytGAKSAD TO 0 
a£AR • • 

SET DE^^CE TO SGREEt? 

USB-"SmartGuy:Fo:cBASE+/Uac:£cx £ilesi Clones. db£" ■ ' ■ 

qo TOP ' 

$TD{^ ITDMBStl TO INITIATE 

GO Borrcu 

STORB ' TO •cargetl 

STORE! * ' TO Targets 

6T0REI * .'TO Targeta 

STORE.* ' TO Objectl 

BTOftE 'TO Object2 

STORB ' • 70 Object3 

STORE 0 TO ANAL ' 
STOSIE 0 TO ESiATCH 
STORE 0 TO HMATCH . 
STORE 0 TO GMATCH 
STORE 0 TO IMATCH 
STORE 0 TO JTP 
SICRB 1 TO BAUr 
DO mZLE .T. ' 

* 'Frogreon. 1 'Subtraction 2.£n[t 
Date,,,,t.ao/Xi/94 . 

• * Version.! Fos<BASE4>/Hac, revlsloa 1.10 

* Kotes....; Foroat file Subtract'ioA 2 
.* * • 

SCREEN 1 TyPE 0 HEM)ING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT •Geneva', 9 COLOR 0,0,0, 
0 TJXSLS 75,120 TO 178,241 STO*S 3871 CODOR 0,0,-1,24610,-1,8947 

G PIXELS 27,154 SA^ 'Subtraction Menu- SIYLE 65536 PONT "Geneva' , 274 COLOR 0,0,-1,-1,-1,-1 

G PIXELS. 117, 126 GET SNATCH STSfLE 65536 FONT "ChicagoMS PICTURE '6*0 Exact " SIZE'15;62 *00 

6 "PIXELS 135,.126 GET HMATCH -STifLE 65536 FONT •qhica?oM2 .PICIURB 'G-C Hc«rolog©US»- SIZE J1S,1 

e PIXELS 153,126 GET CMklXX SIYLE 65E36 FOOT *ChicagoM2 PICTURE '^'C Other epc" SIZE 15,84 

e PIXELS 90,152 SAY •MatchGSi". STYLE 65535 FOOT •G€nevaM2 COLOR 0,0,-1,-1,-1,-1 . 

PIXELS 171,126 GET Imatch STifLE 65536 FOTTT •CkicagoM2 PICTURE »e*C Incyte" SIZE -15,65 CO 

G PIXELS 252,137 GET initiate STVli 0 FONT •Geneva", 12 SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 

G PIXELS 252,236 GE:r terminate STHjE 0 FO^r^ 'G^ncvsL',12 SIZE 15,70 COLOR 0,0,-»l, -1,-1,-1 

G PIXELS 252,35 SAY "Inclucle clones ^ STlfLS 65536 FONT •Geneva", 12 COLOR 0,0»-l,-l,-l, -1 

0 PIXELS- 252,215 SAY •->" STYLE' 65536 FONT 'Geneva', 14 COLOR 0,0,-1,-1,-1,-1 . 

*6 PIXELS -198,126 GET PTF STYLE G5536 PCNT 'ChiciagoM2 PICTORE -0*0 .Print to file' SIZE IS^S 

G PIXSL6 90,9 TO 1$1,109 STYl£ 3871 COLOR 0,0,-1,-25600,-1,-1 

. G PIXELS 90,38*8 TO'181,397 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

G PIXELS 81,296 SAY ^Saclcground: * STYLE 65536 FCWT "Geneva", 270 COLOR 0,0,-1,-1,-1,-1 

6 PIXELS 45,135 GEnP ANAL STYLE 55536 FOOT •Chicago",.12 PICTURE "G*R Overall (Function"' SIZE 4 

G PIXELS 81, i6 SAY "Target:" STYLE 65536 FONT •Geneva'. 270 COLOR 0,0,-1,-1,-1,-1 

G PIXELS 108,20 GET target! STYLE 0 PCOT 'Geneva-^S SIZE 12,79 COLOR 0,0, -'I, -1,-1,-1 

•G PIXELS 135,20 GET target2 sm£ 0 KWT 'Geneva', 9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 

.G PIXELS 162,20 GET target3 STYLE 0 PObTT "Oeneva'^S SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 

G PIXELS 108,299 GET objectl STYLE 0 FCtTT 'Geneva*, 9 SIZE 12,79-COLOR 0,0,-1,-1,-1,-1 

8 PIXELS 135,299 jSET object2 STYLE 0. FOOT "Geneva*, 9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 

8 PIXELS 162,299 GET bbject3 STYLE 0 FCOT •Geneva", 9 SIZE 12,79 COLOR 0,0, -l,-l-,-l,-l 

'9 PIXELS 276,324 'GET Bail STYLE 65536 FO^rr 'Chicago', 12 PICTURE "6*R RunjBail out" SIZE 4112 
« 

* EOFt Subtraction. 2. £mt 
READ • 

IF Bail*32 
CLEAR 

CLOSE DATABASES 

USB * Smart Guy ; FoxBASEt y^tac 8 fox files: clones. db£" 
.SET SAPCT? ON 
6CREE24.1 OFF 
RETURN 



A 8 



wo 95/20681 



PCr/US95/01160 



WDTP 
1570^ VMi(5YS(a) ) 10 
STORE UPPER (TargetZ). TO Target X 
ST0fffi.0PPER.(Target2) T0Target2 
B^TOBE UPPER (TargebB) TO Targeb3 
STORE UPPER(ObjQCtl) Tp ObjecCX' 
STORE UPPEK(0bject2) TO- Object2 
STORE UPPER (Ob ject3) TO Objects 
clear 

SET T,^ C3N . 

GAP s TERMZKATE-'ZKITZAT&fl 
GO INITtATB 

copy NEOT GAP FIELDS N0>!3ERi library, D,P,Z,R,EmOT,S,PESCRI?rOR,eTART.RFEND,l TO TEMPNUM 
USE TEMFMUH 
COUNT TO TOT 

copy TO TEMPRED FOR Ifc'E* ,OR,D='0» •OR.D='H' •OR,D='N» .OR,Dn*I* 
VSB TEKPRED 

IF IbiatdhaO .AND. l^CChsO CnatchcO IKATCH=6 

copy. TO. TEdPDESIG 



copy STOTCTORE TO TEMPDSSIG 
.USE TEKEPDSSrG 
IF Boatchwl 

APPEND FROM TH/iBif^ FOR 0^*B* 
ENDIF 

IF'Kmatch-1 

APPEND FROM TEM^JM FOR V^^K* 
IMDIF 

XF Cntfitchsl 

APPEND FRiai' TEKENUM FOR Ds'O' 
ENDIF 

IF ItTAtchsl 

APPEND FROM TEMRNUM FOR 15='I».0R.D='X* 
r,OR.Do*N» 



CODNT TO STARTOT 

copy STRUCTURE TO TEMPLIB 
•USE TEMPLIB • ■ 

APPEND FROH Ti^iPDESiG FOR librarymUPPER(targetl) 
IP targetSo* . • 

APPEND^FROM TEMPDESIG TOS< library=UP?SR(tttrgct2) 

moiF ' 

IP target3<>' • . 

APKEND FROM. TEMPDESXG FOR librarymOPPER (targets) 
INDIF 
COOMT TO ANAI/TOT* 

USE TEMPDSSIG 

copy STi>UCUUR£ TO TEMPSUB 

USE TEKPSUB 

APPEND FRCM TEMPDSSIG FOR libraiy=UPPER(Objectl) 
IP ta3?gefc2o' • 

APPEND FRCM TEMPDESIG FOR library'sUPPER(0bject2) 
ERDIP 

IP target3<>' 

•APPEND FRCM TEMPDESIG FOR library=UPPER(0bject3) 

EJDIP 
COUNT TO SUBTRACTOT 
SET TALK OFP 



* CQ(MPR£SS10N 50BR0CTXNE A 
? »C(»IPBESSDIS* CDERY LIBRARY' 
USE TEMPLIB 



A9 



wo 95/20681 



PCT/US95/01160 



$C5RT*CK*EMr5W, NUMBER TO LIBSORT 
USE XiZBSORT 
C0U^7T TO 

REPIACE AUi RFEJTO WITH 1 
MARKl =1 

DO WHILE 6Wa«0 ROLL 
IP MAKKl >s IDGEtTB 
PACK 

COUNT TO AUNIQUE 

LOOP 

ENDIF 
GO UAKKl 
EUPc= 1 

6T0RS B?niy TO nsSTA 
6TX)R£: C TO DS5IGA . 
SW » 0 

*D0 WHILE SW=0 .TEST 
fiXIP 

STORE ENnUf TO TESTS 
STORE D TO D3SXGB 

IP TEg TA s TE5TB*.A}3D.DSSIGABia&SIGB 

DSLBIS 

SUP = DUP<fl 

LOOP 

EKDIF 

go'karki 

SW=1 
LOOP 

LOOP 

2NDZX) KOLL 

SORT CN KPaqD/D^KOMBER TO T^IPTSaRSORT . 
USE T^S'IPIT^RdORZ' 

^REPLACE ALL START KTTK RfmJ/XDOmS* 10000 
COWP T0 T£>IPTARC0 

♦ CCMPBfeSSIOM SUBRCmOCNB B 

? *catanfsssjm tabget LZBRARy* 

USE TEMPSUB 

SORT ON ENTOr.NOMBER TQ-SUBSORT 
USE SUBSORT 
COONT TO SDBOEKE 
REPLACE ALL RFQTO WTXH L 
MAHKl c 1 - 

DO WHILE SW2sO ROLL 
IF KARKl >= SDEGQ^ 
PACK . 

COUKT TO BUNIQUE 

SW2;:1 

LOOP • 

ENDIP 
GO MARKl • 
DUF ■ 1 

STORE. E^7^Rl^ to tbstx 

STORE D TO DESXGA 
CM c 0 . 

DO VnOLE SWeO TEST 
SKIP 

STORE EtTTHY TO. TESTS 
STORE D TO DBSIGB 

IF TESTA = TESlS.AND.C^IGArCESIGB 



50 



wo 95/20681 



PCT/US95/01160 



LOOP 

GOMmi 

REPLACE IMPEND WITO CUP 
MARKl = (ARKlttUP 

LOOP 

£KDDO T£S3} 
LOOP : 

£K[DDO ROtL 

SORT ON KFSHD/Di NUMBER TO TQ{PSUB$ORT 
;T7SB TO1P5UB90RT 
*Il£PU^ ALL START VOTSi RP£»D/ZDGQ3E*10000 
CX7UNT TO TTS^SUBCO 



♦FUSION R OUTIW S / . 

7 'S lBTOA CriKG LIBRARIES* 

ITSE S UBTO ACTIOW 

copy 67KUCTURB TO CRU^OSR 

S£Z£Cr 2 " 

0SB_^g3PB30RT 

6ELBCT 1' 

VSE CRUNCHER 

APPQ^S) FROK TEMPTARSORT 

COUNT TO BAILOUT 

IdARK ? 0 
• 

DO TOILS .T,. 
BKTiKCTP 1 * . 
HARK B MARR-hl 

IF MASK>BA1L0UT 

£XXT 

•GO MARX 

STC gE^ ENTIg' TO SCAKDISR 
SSL23CT 2 

LOCATE. TOR Dmi^rSCAKNER 
IP FOUND!) 
STORE RFEHD TO BITl 
STORE RFEND TO BXT2 

STOR E 1/2 TO Bin 
STORE 0 TO BITS 
ENDIP 
SELECT I 

REPLACE BGFRBQ WIIH BIT2 
REPLACE ACTUAL WITH BITl 
LOOP 
GS3D0 

SFTfTT 1 , 

REPLACE ALL RATIO WITO RPEND/ACtUAL 

7 'DOING PINAL SORT B? RATIO' 

SORT CW.RAT10/D,BGn^/D,DESCRI?I0R TO PINAL 

USE PIMAL 

fieb balk off 

CO CASS. 

CASS PTPsO' 

SET DEVICE TO FR337T 

S ET PR INT ON 

EJECT •**. 

CASE PT^sl 

SET ALTERNATE TO "Adenoid .Pabent Figures sSubtracb ton, bxt" 



5 1 



wo 95/20681 



PCTAJS95/01160 



flTORE ^^|6V5(2))' TO riKTlMB 
XF FTSTZKEkSTAPTZHE 

SMDIP 

GipRE FUTTIME BTMCnXZ.TO C3QHPSEC 
'£TORE CQMPSEC/60 TO CCmON 

•SKT MRPGIN TO 10 

81,1 BAY "Iiihrary Subtraction toalysiB- STYLE 65536 FONT ■Gen©va\274 COLOR 0,0, 0,-1, 

7 • 

7 

7 

7 

7 dateO 

^77 TIMSO 
7 *Clone nuniberfi * 
7 7 : -STR (INITIATE r 5 , 0 ) 
,7? ; through ' • ' 
?? 5TR{TORMINATB,6,0} 
7 *Iiibrariefii * 
7 Targctl 

IP 1^06120 • 
77. '* * ' 
r? Targets 
ENDI? 

IF Iarg6t3<>* 
??•,'' 
77 l^argetS 

7 'Eubtractix^g: 
7 CSbjectl 
IF-0bject2o* 
??•',.• 
7? Objeota 

EMDIF 

IP Qbject3<>* 
77 * 
77 Objects 

DflDIF . . • 

•7 'Designationsr 

IF Eraatch=0 .AND. Jfciatch=0 .AKD. Cnatch=0* .AND, IMATCH=0 

?? 'All' 

ENDIF 

IF finatchal 
?? 'Ejcaet,' 

El^IF 

IF Hroatchsl 
77 •Human,* 

*IF cmatc^sl 
?? 'Other ep.' 
ENDIF 

IF Imatchol 
H3DIF 

7 'Sorted ABUNDANCE*- 

BNDIF. 

IF 

7 'Arranged ty FUNcnca*' 
ENDIF 



5 2 



wo 95/20681 



PCT/DS95/01160 



- ? *7btal olones reprissentedi ' 
?? STR{Wr,5,0) 
? 'Total •clones analyzed: * 

? *Tobal.cQrrputRtion.bi2R8; 
.?? STR{COMmtN,5,2) " 
?7 > xftinutea' ' 

? * ^ - 

7' 'd B designation £ = distribution z s location, r & function b & species i s Inte 
7 ' 



SCnmi 1 TYPE 0 READING "Screen V AT 40,2 SIZE 266,4^2 PZXEbS FOOT *Geneva*,9 COLOR 0 0 0 
DO CASE . ' ' 

CAfiE ANAI^l 
?? STR(ADHIQU£,4,0} 

'7? * genes, for a total of * - 
.,?? 5TR{ANAi;iOT,i4,0) ' 
?? • clones* 
? * 

SCREEN 1 lYPS 0 HEAD1N3 -Screen !■ AT 40,2 SIZE 286,492 PIXELS FOOT •Gcmeva",? COtOR 0,0,0, 
list OPP fields nutdDer,D,r,Z,RjEMI?Y,S,DEeCRIPT0R,EGPREQ,RFEKD,RATIO,I 
SET PfUNT OFF 
CLOSE DATABASES . 

•USE/fiirartCuy;FoXBASE+/Mac:£Gx files : clones «dbf* 

CASE. Pmj^2 

arrange/function 
SET PiUOT* ON 
SSn* KEADZI7G m 

SCREBI 1 TVTC. 0 HEftDBJG "Screen I'.AT 40,2 SIZE 286,492 PIXELS ' PC3!NT 'Helvetica", 268 COLOR 0 

? • 

'? ■ BIDDING FRCrrSDTS' 

? , • . . 

SCREEbJ 1 TyPE 0 KEftDIMS 'Screen 1? AT 40',2 SIZE 266,492 PIXELS PONT "HGlvefcica'^aeS COfLOR 0 

7 'Surface molecules end receptors i ' 

SCREEN 1 TJfPE 0 HEADIM3 'Screen I" AT 40,2 SIZE 286,492 PIXELS PONT •Geneva-, 7 COIXJR 0,0,0, 
Ufit OPP fields nuiiiber,D;F;z,R,ENTRY,6,ESSCRIPTOR,BGPR3Q/RPEND,RAT10,l FOR Ra«B» 

r ' . ' ' 

•SCPEEN-l TirPE 0 HEADIKG 'Screen 1" AT 40,2 SIZE 286,492 PIXELS .FONT -Helvetica" ,265 COLOR 0 
? •Calcium-binding proteins; • 

SCREEN 1 TlfPE 0 HEADIN3 'Screen 1" AT 40,2 'SIZE 266,492 PIXELS FONT "Geneva*,? COLOR 0,0,0, 

list OFF fields nXffriIber,D,r,Z,R,ENTRlf,S,DESCRIPTOR,BGFREO,RFEND,RAT10,I PGR Rs'C 

• • • . • 

SCREEN 1 WPE 0 HEADING 'Screen I' AT 40,2 SIZE 286,492 PIXELS PONT -Helvetica" ,265 COLOR 0 
7 *tiganda *and effectors i! 

SCREEN 1 TlfPB 0 HEADING 'Screen 1" AT 40,2 SIZE 266,492 PIXELS FONT "Geneva*, 7 COLOR 0,0,0, 
list OOET fields nurriber,D.,F,z,R,EOTRy,S,DESCRIPTOR,BSFREQ,Rm©^RATIO,I FOR R='6* 

SCREEN 1 TOPE 0 HEAD11«3 'Screen 1' AT 40,2 SIZE 286,492 PIXELS PCBTT "Helvetica" ,265 COLOR 0 
r 'Other binding proteins t* 

SCREEN 1 T^PE -0 HEADING 'Screen 1' AT'40,2 SIZE 286,492 PIXELS FCNT -Geneva*, 7 COLOR 0,0,0, 
list OFF fields nurrber,D,F,2,R,ENrRY,6,DSSCRIPT0R,KrRSQ,RFEND,RATI0,I FOR Rs'I* ' 
7 . . . 

SCREEN 1 TYPE 0 HEADING 'Screen 1' AT. 40,2 SIZE 286,492 PIXELS FONT 'Helvetica ',2 68 COLOR 0 
? « . , ONC06a«ES' 

7 . • . . 

SCRBEM 1 0 HSADD^ 'Screen 1' AT 40,2 SIZE 286,492 PIXELS PONT '-Helvetica ",2 65 COLOR 0 

7 'General oncogeneai ■ ^ ^ 

SCREEN 1 WPE 0, HEADING '.Screen 1' AT .40, 2 SIZE 286,492 PIXELS .FONT "Geneva"',? COLOR 0,0,0, 
list OFF fields &UfhberrD)F,Z,R,ENt7lY,S, DESCRIPTOR, BGFHEQ,RFEND, RATIO, I FOR Rs'C 

■ 

SCREEN 1 T^E 0 HEADING 'Screen I' AT 40,2 SIZE 286,492 PIXELS FONP -Helvetica -,265 COLOR 0 
7 'CTTP-bindiag proteins i ' • • 

SCREEli 1 WPE 0 KEADINS 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT "Geneva',? COLOR 0^0,0, 
list OFF fields number, D,P,Z,R,E2^V#S,DSSCRIPI0R,BSPREQ,RPSND, RATIO, I FOR Ra*0' 

t I * 



5 3 



wo 95/20681 



PCT/US95/01160 



SCR£2{7 1 IVPS 0 K5APING "Screen 
7 'Viral clenentei ■ 
SCPSm X TiSB 0 'Screen 
Use OFF Xaelds nuinbertO,7,Z,R, 



1 1YPE 0 KEADSIS "Screen 
?• 'Kinases and Phosphatases i * • 
dCHEE27 1 TVPS 0 HEADING "Screen 
list OFF fields number,D,F,z,R, 



AT 40,2 SIZE 286,492 PIXELS FOOT -Helvetica" ,265 COLOR D 

EtJTO,S, DESCRIPTOR, BGFRBO,Rra^D, RATIO, 1 50R R='v* 

AT 40,2 SIZE aasUsS PIXELS FOITr 'Helvetica-, 255 OStCR. O 

AT 40.2 SIZE 286.453 PIXELS FOCTT "GenevaV? COLOR 0 0 0 
'V,S,I3[ESCKMOR,BCFRBQ,RF£5ID.RATIO,I FORR::'y» 



SCREEN I'TVPE 0 HEADING "Screen 
? 'Tumor-related antigensi ' 
SCRESrr 1 TYPE 0 HEADIM? ."Screen 
list OFF -fields number, D,F,Z,R, 
?. 

SCRE^ 1 TYPE 0 READIKC * Screen 
? ' ' PROnEIN SY17IHET 

7 - 

SCREEN 1 TyPB 0 HEADIN3 "Screen 
7 •Transcription and Nucleic Acid 
SCREEN 1 TYPE 0 HEADING "Screen 
list OFF fields niiirtoer,D,F,z,R, 



ENTRY 



fiNTRT 



SCRESi 1 TYPE 0 HEADING "Screen 
? * Translation: ' 
SCREQ7 1 TYPE 0 HEADINi^ "Screen 
list OFF fieldfi number, D,P,Z,R>H7rRy 



SCRESI X TYPE 0 HEADING "Screen 
? *Rijb o3anal protsins: ' 
SCREEN 1 TYPE 0 HEADING "Screen 
list OFF fields nuirber,D;r,z,R, 



ENTRY 



SCREEN 1 TYPE 0 HEADING ."Screen 
7 'Protein processing: * ' 
SCR&EK 1 TYPE 0 HEADING "Screen 
list OFF fields uumber,D,P,Z,R# 
7 

SCRE^ 1 TYPE 0 HEADB33 'Screen 
7 

7 ' 
7 



SCREEN 1 TYPE 0 HEADING "Screen 
?• 'FerrpprotelcBi ' 
SCREEN 1 TYPE 0 HEADING "Screen 
list OF? fields nuJtiber,D,P,Z,R,SMrRY 



SCREEN 1 TYPE Q HEADDTG "Screen- 
7 'Proteases and inhibitors:' 
SCREEN 1 TYPE 0 HEADING "Screen 
list OFF fields nuiriber;D,P,Z,R, ENTRY 



SCREEN 1 TYPS 0 HEADING "Screen 
7 'Osciditive phosphorylation!.' . 
SCRE^ 1 TYPE 0 HE^DBlt} "Screen 
list OFF fields number, D,P, 2, R, 



SCREEU 1 TYPE 0 HEADING* "Screen 
7 'Sugar inetabolismt ■ ' 
dCREEN 1 TYPE 0 HEADI13G "Screen 
list OFF fields number,D,P,2,R,ElJTRY 



6CREDI 1 TYPE 0 KEADIKC 'Screen 
7 'Amino acid metaboliszit: ' 
SCREQ3 1 TYPE 0 HEADING "Screen 



AT 40,2 SIZE 286,492 PDffiLS PCOT " Helvetica", 265 COLOR 0 



^Vidnil^Vh^^^ ^ -Geneva-, 7 COLOR 0,0,0, 

'V,S,DESCRIPTOR,BGFREQ,R?EMD,RATIO,I FOR R='A' 



'u^U^dJ^nltL^^^ "Helvetica", 268 COLOR 0 

MACHnJERY* PROTEINS I • ' ^ ^ 



AT 40,2 SIZE 206,492 PIXELS TONT "Helvetica" ,265 COLOR 0 
binding proteins i' w 

AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva", 7 COLOR 5 0 0 
^^,S,DESCRIPT0R,BGFraQ;RFEND,RATIO,I TOR Rs^D' 



ENTRY 



AT 40,2 SIZE 286,492 PIXELS ' FONT 'Helvetica ",265 COLOR.O 

AT 40,2 SIZE 286,492 PIXELS PCNT "Geneva", 7 COLOR 0,0.0,' 
lY,6,DESCRIPTOR,BC3FRBQ,RFEND,RATrO,l POR R^^T' 

AT' 40,2 SIZE 286,492 PIXELS FCSTT "Helvetica" ,265 COLOR 0 

AT 40,2 SIZE 286,492 PIXELS FONT "GenevaV? COLOR 0.0 0 * 
ar,S,DESCRlPT0R,BGPR2Q,RFEND,RAT10,l TOR Rp^R' 

AT 40,2 SIZE 286, 492 PIXELS FOOT "Helvetica" /2 65 COLOR 0 

AT 40,2 SIZE 286,452 PIXELS PONT "Geneva"* 7 COLOR. 0,0.0. 
'^,S,DESCRlPT0R,B3rREQ,RFEND,RATIO,l TORR-a' 



AT 40,.2 SIZE 286,492 PIXELS, PQrrf "Helvstica" ,268 COLOR 0 



ENZYMES' 



AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica ", 26S COLOR 0 

AT 40,2 SIZE 286,492 PIXELS FONT "GericvaS7 COLOR 0,0,0, 
.Y,S,DESCRIPICR,BGFREQ,RFEND,RATIO,I FOR R^Ip' * 

AT 40,2 SIZE 286,492 PIXELS PONT "Helvetica", 265 COL^ 0 



ENTRY 



AT 40,3 SIZE 286,492 PIXELS FONT •Geneva", 7 CQdOR 0.0 0 
[Y,S,CHSCRIJTOl,BGFREQ,RrEND,RATIO,I TOR t:5p'^^ ' ' ' 

AT 40,2 SIZE 285,492 PIXELS PONT "Helvetica ",2 65 COLOR 0 

AT 40,2 SIZE 286,492 PIXELS FONT •Geneva'.? COLOR oio.O, 
IY,S, DESCRIPTOR, flGFREQ,RPEND, RATIO, I FOR R=rlz' 

AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica ",265 COLOR 0 

AT 40,2 SIZE 236,492 PIXELS PONT »Geneva",7 COLOR 0,0,0. 
",S,DESCR1PIDR,B©F'REQ,RFEND,RATI0,1 TOR R«'Q' , 

AT 40,2 SIZE 286,492 PIXELS ^Q^7^ •Hel>^tica",265 COLOR 0 

« 

AT 40,2 SIZE 286,492 PIXELS FONT "Geneva",? COLOR 0,0,0/ 



wo 95/20681 



PCT/US95/01160 



list OPP fields nurnber,D,P»z,R,Etmw,s,I)ESCiaPT0R,B0Fl^,RPEMD,RATlOrl FOR Ra*M' 

SCREEN 1 TVPE 0. HEADING -Screen !• AT 40,3 SIZE 286,492 PIX3LS PCNT -^S^ficl'^SS ^SoR 0 
7 '29uclelc acid zoetaboliEm: *' • • * 

BCBim a.lYFB 0'H£ADB?S "Screen '1* k*V 40|2 Size 286,492 PZXniS POTT "Geneva",? COLOR 0,0,o/ 

J.i8t. OFF fields nunOaer,D,F,Z,R,ENTR^, 5, DESCRIPTOR, BGPREQ,R7E»D,KATZC3|'X FOR Rb*H' 

• • • 

"SCREEK'l TYPE 0 KEAOINCS "Screen 1* AT 40,2 SIZE 286,492 PXX£L5' FOtTT "Helvetica", 265 COLOR 0 
7 'Lipid znetabo Li sxn: ' * . 

BCRE2tf 1 1VPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS PCOT "Geneva",? COLOR 0,0,0, 
list OFF fields number, D,FiZrR.EmTlV,S,DBSCRIPIOR,BGFHEQ,R?ENDi RATIO, I FOR Rb*W* 

eCREEN 1 TyPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica", 265 COLOR 0 
7 'Other enzynes!* 

* SCREEN 1 Tn?E 0 HEADIKG 'Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva",?* COLOR 0,0,0, 
lidt OFF fields nurtiber,D',F,Z,R,ENnCf,9,DE£CRIPT0R,BGFREQ,RrEND,RATI0,I FOR R^'B' 

SCREEN 1 TVPS 0 HE^DINS "Screen 1^ AT 40,2 SIZE 286,492 PIXELS POTT "Helvetica", 2 68 COLOR 0 

7 ' HTSCELIANEODS CAlisGORIES' 

? 

* • * 

SCREQT 1 TYPE 0 HEADIN3 "Screen !• AT 40,2 SIZE 286,492 PIXELS FCMT "Helvetica" ,265 COLOR 0 

7 'Screes re^cnee I ' • . < . 

SCBIW'l TSfPE 0 HEAD1M3 "Screen 1" AT 40,3 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 
li«t OFF fields nwiber,D,F;Z,jf(,Emy,S, DESCRIPTOR, B6FREQ,RPS^©, RATIO,! FOR Rs'H' 

SCREEN 1 TYPE 0 HEADIN3 'Screen 1" AT 40,2 SIZE 286,492* PIXELS FWT "Helvetica", 265 COLOR*0 

7 'Strucciiral: ' 

SCREEN 1 TirPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0, 
list OFF fields nunaDer,D,F,2,R,ENntf,fi,DSSCRlPT0R,BGFREQ,RPaiD,RATI0, 1 -FOR R='K' 

SCREEN 1 TJfPE 0 HEADIM3 "Screen 1* AT 40;2 SIZE 286,492 PIXELS FOOT "Helvetica", 265 COLOR -0 
7 'Other clones: ' • • ■ • 

SCREEN 1 TYPE 0 READING "Screen 1" *AT 40,2 SIZE 286,492 PIXELS * FONT' -Geneva", 7. COLOR 0 0 0 
list OFF fields number, D,P, 2, R, ENTRY, S,teSCRIPTOR,BGPRfiO,RFE«D, RATIO, I FOR Rs'X' 

SCREEN 1 T5fPS 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS PONT "Helvetica", 2 65 COLOR 0 
7 'Clones' of vnknawn function:* • . . 

SCREEN 1 TYPE 0 HEADIM3 "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva",? COLOR *0,0; 0, 

list OPP fields Xluniber,D,P,2,R,ENraY,S,DESCRIPT0R,BGFREQ,RFEND,RATI0,l FOR R«'U' 

£3TOAS£ 

DO "Tcflt print .prs" 

SET PRINT OFF 

S£T DEVICE TO SCRE!^ 

CLOSE DATABASES 

ERASE lEMPLilBiDBP 

ERASE 'EHXdPNUM.DBF 

ERASE TIMPDSSIG.DBP 

SET WAPGIN TO 0 

CLEAR 

LOOP 

EMDDO 



wo 95/20681 

* 



PCTAJS95/0n60 



*hft3rthern (eingle) , version 11-25-94 
close databases 
SET TSUJC OFF 
SET PRIOT OFF* 
SET B2CACT OFF 



STORE .* • TO Edbject 

n rrr, .. C * TO Dobject 

STORE 0 TO Numb. 
STORE 0 'TO ZOg 
STORE 1 70 Bail 
DO HHILB .T, 

Program, t Ndrthem (flliigle) .fint 
^ D&tQ....: 8/ 8/S4 

• Version.. I .POXBASB+ /Mac/ rftviBion 1.10 

* Notes. .Format file Morthezn (single) 



SCREEN 1 TOPE 6 HSADmS "Screen !• «P '40,2 SIZE 2B6,492 PIXELS ratrr i<» #vnt«b « « 

e KXELS 15,81 TO 46,397 BmS 28447 COLOR O^-l.-aSMof ^if^f G«»eva«,ia COLOR 0,0,0 

e PIXELS 89,79 10 182,422 STYLE 28447 COLOR 0,0,0.-25600 -1-1 

1 lil''.'..'^'^.ts'i^.^.d'VJ^^^ 



«• riZ rr.^Ii- «Aiiao gjgjo ruNT 'ueneva"., 12 COLOR 0.0 0 -1 -i -i 

I ll^ilV.^ ^ 0 ^ -GenevaM^ SIZE ^S^O^cSoR o;o:o -1 -l' -1 

e PIXELS 80,152 SAY -Enter any ONE of the folloidng,- BT«ij essiT^oS^^ coLOR -i; 

♦'BOP: Northern (single). fmt 
BEAD 

IP Bail=2 
CLEftR . • 
screai 1 off 
*RE?roRN 

Uffl ■SrwrtG^yir<»iASE+/Mac;Pox files i Lookup. flbf 
SET OaUC'CN - 



* • 



IP Eobjecto' 

STORE UPPE R (Sob ject) to Eobject 
SET SftFET? OEP 

SORT .O N En try TO "Loolcup entry. flbf* 

SET sA?mr ON 

USB •Loo)cup cntry.dbf • 
LOCATE FDR LookcEobject 
IP ..MOT.POUNDO • 



LOOP 
BROTSE 

STORE Entry TO Searchval- 

CLOSE CATABASSS 

ERASE ."LooJcup entry, dbf" 

END!? 

•IP • Dob jecto' • 
SEP E XACT OFF 
SET SAPXTSf OFF 

SORT* ON descriptor TO •Loo)q:p* descriptor, dbf 
SET SAPETV On 

USE "LooJcup descriptor, dbf* 

t^TZ FOR UPPER(TRIM(descriptor))s:UPPER(TOIM(Dbbiect)i 
IF .NDT.FOUNDO J^'-^n 

CI£AR 



5 6 



wo 95/20681 



PCT/US95/01160 



LOOP 
BROW SE 

STORE Batzy TO Seardxval 
CLQSB tATABASSS • * ' ' - 
ERASE 'Loobip descriptor .dbf" 

SET EXACT ON ' - 

ENDXF * 
• 

IF KuniboO 

USE 'SmartGuvtFo3<BAS£t/Kac:Fox files: clones. db£^ 

GO NConb ^ 

BRDW3B 

.€TORE Bnbxy^ TO Searchval 

t 

? *Korthem analyaia for ezitry ' 
7? Seafchval 

7 *dcer y to proceed' 

KAZT TO OK • 

CLEAR 

IF. UPPER (CK)o»V< 
scre en 1 off 
REIURK 
WDLT 

* 

^ CQMPHESSIOM'SUBROUTIKE SOR Llbraxy,dbC 

7 'Coiqpressiflg the Liborarles £ile now.*-..* 

USB "SinartGuy:FoxBXSE4/Hac:Fox files slibrariea.dbC- 

SORT ON library ^TO *caii^eaBed libraries. dbf* 

* FOR eiite r€g>0 ' 
SET SAFETY CM 

USE 'Ccopressed libraries*. dbf 

PETjSTS FOR entered^'O 

PACK 

COUNT TO TOT 
SW3aO . 

CO WHILE SW2^ ROLL 
•IF MAR!?! >B TOT 
. PACK , • 

LOOP 

GO lORKl , 

* STORE library TO TESTA 
*SKIP 

STORE Libra ry to tsstb 

IP TES TA s TKS'i'ti 
ENDIF 

MARKl ^ MARKl+1 
lOOP ' 
£MDDO BDUi 

t 

* Northern analysis 
CLEAR 

7 'Doing the northern row.,. » 
SET TAUC CN . 

USE * smar b Giy i FoxaASE^ /Mac t Pox files t clones .dbf"- 

SET SAPETy OFF 

COPY TO *HitB.db£' FOR entr/BBearchval 
SET SAFETY CN ' 



5 7 



wo 95/20681 



PCT/US95/01160 



* MASOER ANTlLYSrS 3; VERSION 12-9-94 

* Master znenu for ana lysis output 
CU06S DATABASES 

SET TALK OFF 
SET SAPETSf OFF 
CX£AR 

SET UD/ICB TO BC?SW 

SET DEFAUUT TO "SmartGuyiFoxBASE+ZMac: fox files sOutput progronsi" 
USB "SmartGuy:FQ«aASE+/Mac:fox files t Clones .dbf" 
GO TOP 

STORE NtJMBSR TO INITIATS 
GO BOTTOM 

STORE NUMBER TO TERMINATE 
STORE 0 TO ENTIRE 
STORE 0 TO CONDEM 
STORE 0 TO ANAL 
STORE 0 TD EMATCH 
STORE 0 TO HMATCH 
STORE 0 TO OMATCH 
STORE 0 TO IMATCH 
STORE 0 TO XMATCK 
STORE 0 TO PRINTON 
STORE 0 TO PTF 
DO WHILE .T. 

* Program.: Ifester analysis. £mt 

* Date..,.: 12/ 9/9f4 

* Version.: FoxBASE^-/Mac, revision 1.10 

* Notes . . . . : Format file Master analysis 

SCREEN 1 T!fPE 0 HEADING -Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT -Geneva^9 COLOR 0,0,0, 
0 PIXELS 39,255 TO 277,430 ST^ 28447 COLOR 0,0,-1,-25600,-1.-1 
e PIXELS 75,120 TO 178,241 STYLE 3871 COLOR 0, 0, -1, -25600, -1, -1 

Q PIXELS 27,98 SAY" "Customized Output Menu" STYLE 65536 FOOT "Geneva", 274 COLOR 0,0,-1,-1,-1 
0 PIXELS 45/54 GET conden STYLE 65536 FOJIT 'ChicagoM2 PICTURE "e*c Condensed format- SIZE 
e PIXELS 54/261 GET anal STYLE 65535 FONT •Chic£goM2 PICTURE »@*RV Sort /number; Sort /entry/ 
e PIXELS 117,126 GOT EMATCH STYLE 65536 FOOT "Chicago', 12 PICTURE "0*0 Exact ■ SIZE 15,62 CO 
Q PIXELS 135,126 GET HMATCH STYLE 65536 FOOT "ChicagoMa PICTURE "e'C Hamologous" SIZE 15,1 
© PIXELS 153,126 GET CmTCH STYLE 65336 FONT ''ChicagoM2 FldUHE "S^C Other spc" SIZE 15<84 
0 PIXELS 90,152 SAY "Matches:" STYLE 65536 FONT "Geneva", 268 COLOR 0,0,-1,-1,-1,-1 
@ PIXELS 63,54 GET PRIOTO^ STYLE 65536 FOOT 'Chicago", 12 PICTURE "@*C Include clone listing' 
@ PIXELS 171,126 GET Imatch STYLE 65535 FOOT "Chicago", 12 PICTURE "0*0 Inqyte" SIZE 15,65 CO 
0 PIXELS 252,146 GET initiate STYLE 0 FOOT "Geneva", 12 SIZE'15,70 COLOR 0,0,-1,-1,-1,-1 
@ PIXELS 270,146 GET terminace STYLE 0 FOOT "Geneva", 12 SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 
S PIXELS 234,134 SAY "Include clones " STYLE 65536 FONT "Geneva", 12 COLOR 0,0,-1, -1, -1 
e PIXELS 270,125-SAy '->" STYLE 65536 FCUTT ''GeneyaM4 COLOR 0,0,-1,-1,-1,-1 
Q PIXELS 198,126 GST PTF STYLE 65536 FOOT "Chicago ",12 PICTURE "@*q Print to file- SIZE 15,9 
e PIXELS 189,0 TO 257,120 STYLE 387! COLOR 0,0,-1,-25600,-1,-1 

0 PIXELS 209,8 SAY "Library selection" STYLE 65536 FOOT "Geneva*, 266 COLOR 0,0,-1,-1,-1,-1 
0 PIXELS 227,18 GET ENTIRE STYLE 65536* FOOT "Chicago", 12 PICTURE "@*KV All;SelBCted- SIZE 16 

*■ EOF: Master analysis . fint 
HEAD 

IF ANAL«9 

CLEAR 

CLOSE DATABASES 
ERASE TEMPMASTER.D8F 

USB ''SmartGuy;FpxBASE+/Mac:fox filesi clones. dbf " 

SOT SAFETY ON 

SCREEtl 1 OFF 

RETURN 

ENDIF 

Clear 

7 INITIATS 

? TERMINATE 
7 .CONDEN 
? ANAL 

5 8 



wo 95/20681 



PCT/US95/01160 



? snatch 
? Hmatch 
? Csnatch 

? m^TCK 
SET TALK ON 

IP ENTIREc2 
USE 'Unique libraries .'cabf* 

REPLACE ALL i WI1M • ' * 

^^^SE FIELDS i , lihname , library , total , entered kT 0,6 

USE •Smart:Guy:FoxBASE+/Mac:fox files t clones. cSbf- 

l^^JPJF^^^'^ ™^ NUMSER>=iriITIAaE,AND.NC©GER<=TE5iMIICATE" 
♦USE tEMPNUM 

COFV STRUCTORE TO TEMPLIB 
USE TEMPLIB 
IF e^IREol 

APPEND FROM 'SiciartGuy :PoXBASS+/Maci£ox files: CI wies.db^' 
ENDIF 

IP EOTrREte2 
USB "Uiiigue libraries.dbf • 

COPY TO SSLE CTED FOR UPP3R(i) = 'y» 
USE SELECTED 

STORE RSCCOUNTO TO STOPIT 
MARXsl 

DO WKILE .T. 

TP MARK>STOPIT 

CLEAR 

ESCIT 

aiDI? 

USE SELECTED 
GO MARK 

STORE library TO THISQNE 
? 'COPYING • 
?? TJHSONE 
USB TEMPLIB 

^W'l^imi^'^'''^^^'^^'^'''' files:Clones,dbf- FOR libraiy.raisONE 
LOOP 
ENDDO 
EM3IF 

USE "SInarcGuy:PoxBASE^-/Kac:fox files tclones.dbf 

CCUNT TO STARTOT 

copy STRUCTURE TO TEMPDESIG 

USE TEMPDESIG 

IP EinarchnO .AND.. HmatchsO .AITO. Qrratch=0 .AND. IMATCH=0 

APPEND FROM TEMPLIB 

EMDIF 

IF Emacchsl 

APPEND FROM T^IPLIB FOR Dss'S* 
E23DI? 

IF Hmatchol 

APPEND PROM TEMPLIB FOR Da'H* 
ENDIF 

IP Omatchal 

APPEND PROM TEMPLIB FOR Do'C 
BNDIP 

IF Imatch=l 

T^END FROM TEMPLIB FOR D=*I' .OR.Do'X' .OR.D=«*N* 
H9DIF 

IP Xmatchol 

APPEND PRDM TP4PLIB FOR Ds»X' 

EIDIP 
CCUNT TO ANALTOT 
set talk off 

DO CASE 

5 9 



wo 95/20681 



PCT/US95/01160 



CASE PTPsO 

SET DEV1C3 TO PHINT 

SET PRINT ON 

E7BCT 

CASE PTPsl 

SET ALTSRJOiTE TO "Total function fiort.txt " 
*SET AL/TERNATE TO '•H and 0 function sort.txt" 
•SET ALTERNATE TO "Shear Stress HUVEC 2:Abundar.ce sort.txt" 
*SET AI/TSroiATE TO "Shear Stress HUVEC 2:Abundance con.tkt' 
*SET ALTERrmTE TO "Shoar Stress HUVEC 2:Punction sort.txt* 
*SET ALTERNATE TO • Shear Stress HUVEC 2: Distribution sort.txfc" 
♦SET ALTERNATE TO ''Shear stress HUVEC l;Clone list-txt- 
^SET ALTERNATE TO "Shear Stress HUVEC 2:Location eort.txt" 
SET ALTERNATE ON 

IP PRimON=l 

61,30 SAY "Database Subset Analysis' STYLE 65536 FONT "Geneva-, 274 COLOrl 0,0»0, -1,-1,-1 
BNDIP ' ' ' ' . 

7 
? 

■ 

? dateO 
?? • ' 
?? TIMSO 

7 ' Clone- numbera ' 

77 STR (INITIATE*, 6,0) 

77 * t±rough * 

7? S"TR(TERMINATS,6,0) 

? 'Libraries; * 

IP a^TIREsl 

7 'All libraries* 

ENnrp 

IP EMTIRE=2 
M7^»l 
DO WHILE .T, 
IF MARK>STOPIT 
EJQT 
ENDIF 

USE SELECTED 
GO mPK 
7 • ' 

77 TRIM(lihnaine) 
STORE MARK+1 TO MARK 
LOOP 



ENDIF 

? 'Desiemations: ' 

IP Einatch=0 .AND. Hmatch=0 .AND. Cfcnatch=0 .AND. IMATCH=0 

77 'All' 

H3DIF 

IF anatchsl 
77 'Exact, ' 
ENDIP 

IF Kmatchol 

77 'Human, ' 

HNDIF ' 

IF Ornatchsrl 

77 'Other .sp.' 

ENDIF 

IF Iinatchsl 
77 'INCVTE* 
ENDIF 

IP Xrratchsl 
77 'EST' 



60 



wo 95/20681 



PCTAJS95/01160 



ENDIF 

IF OCNnEN=l 

? 'Condensed format analysie* 

ENDIF 

IF Pl^PLtil 

? 'Sorted by NUMBER' 

ENDIF 

IF md^2 

7 'Sorted ty ENW 

ENDOCP 

XP AKftli=3 

7 'Arranged by ABUNDANCE' 

ENDIP 

IF ANAL&4 

? 'Sorted by INTEREST' 

ENDIF 

IP AMAL=5 

? 'Arranged ty LOCATION' 
5NDIF ' 
IF ANAL=6 

? 'Arranged by DISTRIBUTION' 

£S^ZP 

IP ANALa? 

? 'Arranged ty FUNCTION' 
EZ^IDXF 

? 'Total clones represented: ' 

?? STR(STARTOT.6,0) 

? 'Total clones analyzed: * 

?? STR (ANAL/TOT, 6, Oy 

? 

7 '1 = library d = designation f = distribution z = location r = function c = cer 
7 

USE TEMPDESIG 

SCREEN 1 TiTPE 0 HEADING "Screen !• AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva-',? COLOR 0,0,0, 
DO CASE 
CASE ANALrl 

* sort/number 
SET HEADING ON 
IF CONDENal 

SORT TO TEMPI CN ENTRY, NUMBER 
DO -CCMPRSSSION number. ?RG' 
ELSE 

SORT TO TEMPI ON NUM33R 
USE TE2^1 

list off fields number,L,D.P,Z,R,C,an'Ry,S,PESCRIPTOR 

*Iist off fifildfl nuinber,L,D,F,S,R,C,I^3TRy,S, DESCRIPTOR, LENGTH,RPEND,INIT,I 
CLOSE DATABASES " ' 

ERASE TEMPI. DBF 
EH)IF 

CASE ANAL;s2 

* eort/DESCRIPXOR 
SET HEADING ON 

♦SORT TO TEMPI ON DESCRIPTOR, ENTRY, NUMBER/ S for D«'E' .OR,I>='K'',OR.D='0' ,OR.D='X' .OR.Da'l • 
*SCRT TO TEMPI ON ENTRY, DESCRIPTOR, NUMSER/S for D='E' .OR.D=' H' ".OR-Do '0' .DR.D='X' .OR.D- ' I ' 
SORT TO TEMPI ON ENTRy,START/S for D= 'E' .OR.D='K' .OR.D=*0* .OR.D='X' .OR,Da' I* 
IF CDNDEN=1 

DO "COMPRESSION entry. PRG" 
EI^ 

USE TEMPI 

list off fields number, L,D,P,Z,R,C,IOTRY,S, DESCRIPTOR, LE^?GTH,RFEND,INIT, I 
CLOSS DATABASES 
ERASE TEMPI. DBF 
SNDXP 



.6 1 



wo 95/20681 



PCT/US9S/01160 



CASS ANAL=3 

* sort Jay abundance 
SET HEADING 09 

SORT TO TEMPI ON ENTRY. NUMBER for D='E* .OR.D='H' ,OR.D= '0* .OR.Dx'X' .OR.Do*I»' 

CO "COMPRESSION abundance. ERG" 

CASE 

* sort/interest 
Srr HEADINCi W 
IF CONDENsl 

SORT TO TEMPI ON ENTOY, NUMBER FOR i>0 
DO 'COMPRESSION interest .PRQ" 

SORT m I/D, ENTRY TO TEMPI FOR I>1 . 
USB TEMPI 

list off fields nirtitoer,L,D,P,2.R,C,EmY,S,DESCRIPTOR,LEtK3TK,RF 

Ojose databases 
erase tempi .dbf 

ENDIF 

CASE AKAL=5 

* arrange/location 
SET HEADING ON 
STORE 4 TO AMPLIFIER 
7 'Nuclear t ' 

SORT W ENTRY/NU^2BER FIELDS RFEND,NU}ffiER, L,D, F, 2 iR,C, ENTRY, S. DESCRIPTOR, LSN^^ 
IF CGNDEN=1 

DO 'Conpression location. prg* 
ELSE 

DO ■Normal subroutine 1" 
EKDIP 

? '^toplasmic: * 

SORT ON EMrRY,NUM3ER FIELDS RPEND.NUl'IBER, L,D,F,Z,R,C,ENTRY.S, DESCRIPTOR, LENGTH, 
IF CONDEN^l 

DO "Ccupression location. prg* 
ELSE 

DO "Nomal aubroutine 1" 

EKDIP* 

•? *Cyt:b3kelecon: • 

SORT ON ENTRY, NUMBER FIELDS RFZ^ro, NUMBER, L^DiF, 2, R,C, ENTRY, S, DESCRIPTOR, LE>n3TH, lOT 
IF CCMDEN=1 

DO ^Corrpression location. prg" 
ELSE 

DO •Normal subroutine 1" 
ENDIP 

? 'Cell surface: ' 

SORT ON ENTRY, NUMBER FIELDS KFH^ro, NUMBER, L,D,r, 2, R,C, ENTRY, S, DESCRIPTOR, IiSt«3TK^ 
IF CONDEN^l 

DO "Canpres9ion location, prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Intracellular membrane: * 

SORT CN ENTRYyNUMBER FIELDS RFH^, NUMBER, L,D,F,Z#R,C,ENTRY,S, DESCRIPTOR, LE^rc^H,INIT, I, COM^^E^ 
IP C0NDEN=1 

DO •Coinpression location. prg" 

DO ■Normal eubroutine 1" 
ENDIP 

? •Mitochondrial:* 

SORT ON ENTRY^NUMBER FIELDS RFENPrNUl-EEni, L,D,F, 2, R,C/EbnRY, S, DESCRIPTOR, LENGTH, INIT, I. CCMMEN 
IF CGNDENal 

DO "Corrpression location. prg" 
ELSE. 

DO •Normal subroutine 1" 
ENDIF 



6 2 



wo 95/20681 • 



PCT/US95/01160 



7 'Secreted J ' 

SORT CN EtJTOY.NUMBER FIELDS RFiiro,NUM3ER, L,D,F, R, C, EtHT^V, DESCRIPTOR, LEN'G^ 
IF C0NDSN=1 

DO "CcnrtDfeaaion location. prg* 

ELSE 

XX) "Normal flubroutine 1" 

ENDIP 

? 'Otheri' 

SORT ON Ermiy,NUMBER FIELDS RFEND,NUMBER,L,D,P,Z,R,C. Emiiy, S, DESCRIPTOR, LEl«3TH, Ih^ 
IF CONDENffl • , 

DO 'Conpresflion location.pro' 
ELSE 

DO "Normal sulsroutine 1» 
? 'tfiiJcnown; ' 

SORT ON lOTRY, NUMBER FIELDS RFE2to,NljMBER,L, D,F, 2, R, C, EOTRY, S, DESCRIPTOR, LH>^ 
IF C0NDEN=1 

DO "Coi[5)res$ion location .prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

IF CQNDSNsl 

SST DiEV ICS. TO PRINTER 

SET PRINTER ON 

EJECT 

DO "Output heading. prg' 
USE • Analysis location.dbf ' 
DO "Create bargraph.prg* 
SET -HEADINO OFF 

? ' FUNCTIONAL CLASS TOTAL UNIQUE NEW % TOTAL' 

* 

LIST OFF FIELDS 2, NAME, CLC»IES,GQJES, NEW, FERCENPT, GRAPH 
CLOSE DATABASES 
ERASE T£>P2.DBF 
SET HEADING ON 

*USE •SmartGuy:FoxBASS*/Mac:£ox files tTEMEMASTER.dbf" 
E^IF 

CASE ANALtrS 

* arrange/distribution 

SET HEADING OM 

STORE 3 TO AMPLIFIER 

? •Cell/ciflsue specific distribution!' 

SORT ON DCTRY^NUMBER FIELDS RFEt©,NUMBER, L,D,F, 2/ R,C, ENTRY, S, DESCRIPTOR, LESG'IW, IN1T,1,CQMME^ 
IF OCNDENsl 

DO "Conpression diacrib.prg" 
ELSE 

DO "Nonneil subroutine 1" 
ENDIF 

7 'Non-specific distribucioni ' 

SORT ON EKTRY,NUMBER FIELDS RFEiro, NUMBER, L,D,F, 2, R, C, ENIIIY, S, DESCRIPTOR, LENGTH, INIT, I, COhDiD?- 
IF C0NDEJ«=1 

DO "Ccsrpression distrib.prg" 

DO "Normal subroutine 1" . 
H^IF 

? 'Unknown distribution: • 

SORT CN ETirrRY, NUMBER FIELDS RFEiro,NUMBER, L,D,F, 2, R, C, E^TI^<Y, S, DESCRIPTOR, tiEKCTK, m-T, ^ 
IP CCNDENal 

DO "Comression distrib.prg" 
ELSE 

DO "Norw^l subroutine 1" 
aiDIF 

IF OCNDENsl 

SET DEVICE TO PRIKIER 

SST PRINTER ON . 



wo 95/20681 PCT/US95/01160 



EJECT 

DO ** Output heading. prg' 

US3 'Analysis distribution, dbf 

DO ■Create bargraph.prg' 

SOT HEADING OFF 

? • FUNCTIONAL CLASS TOTAL UNIQUE % TOTAL' 

7 ' 

LIST OFF FIELDS P. NAME, CLONES, GENES, PERCENT, GRAPH 
CLOSE DATABASES 
ERASE TEMP2.DBF 
SET HEADING ON 

•USE "SmartG^ortPoxBASE+ZMactrox files :TEMPMASTSR.dbf 
EllDIF 

CASE ANAL=7 

* arrange/ function 

SST HEADING ON 

STORE 10 TO AMPLIFIER 

7 ' BINDING PROTEINS' 

? 

7 ' Surface molecules and receptors s ' 

f£^L.^3T^-'^^™ FIELDS RFEND,NUMBER,L,D,F,Z;.R,C,ENTRy,S,DESCRIPTOR,LnQGTH,INIT,I,C^ 
IF CONDENsl 

DO •Conpression function. prg" 

DO 'Noimal subroutine 1" 

? ' Calcium-binding proteins : ' 

SORT^ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z,R,C,EmTlY, S, DESCRIPTOR, I£NKJra, INIT, I, CO 

DO 'Carnpression function .pro" 
ELSE 

DO 'Nortnal Bubrbutine 1" 
ENDIF 

? 'Ligands and effectorsi' 

SORT ON ENTRY^NUMBER FIELDS RFEt©, NUMBER, L,D,?, 2, R,C, ENTRY, S, DESCRIPTOR, I£NGTO,IN1T,X, COWMEN 
IF CCaiDEN»l r r r / 1 

DO 'Cojipression function, prg" 
ELSE 

DO "Normal subroutine 1" 
EMDIF 

7 'Other binding proceins; ' 

^OR^^^ENTRY. NUMBER FIELDS RFEOTD, NUMBER, L,D,F, 2, R,C,EmY,S, DESCRIPTOR, LENGTH, INIT, I, CC»^ 
DO •Compression function .prg" 

DO 'Normal subroutine 1*» 

ENDIF 

•EJECT 

? ' ONCOGENES' 
? . 

7 'General oncogenes:' 

SORT W ENTRY,NUMBER FIELDS RPEND, NUMBER, L,D,F,Z,R,C,Emy,S, DESCRIPTOR, I£JaQiH,INIT, I, COMMEW 
IF 0QNDEN=1 

DO ■Compression function .prg* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

7 'GTP-binding proteins I • 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, P, Z, R,C, ENTRY, S,DHSCRIPTm, LB^^GTH,IMIT, X,CO^^ 

^^^^^ ^^^3^tI^3'^E^t^ *"* 

DO ••Compression function*prg» 
ELSE 

DO "Normal subroutine 1' 
ENDIF 

7 'Viral elements I* 



wo 95/20681 



PCT/US95/01160 



SORT ON EOTRY/NUMBER FIELDS RFHOT,Nl»^BER,L,D,F,Z,R,C. S. DESCRIPTOR, LH^TH, lOT 

IF CQNDENal 

DO "CoirpresBion function. prg* 
ELSE 

DO "Konnal subroutine 1* 
ENDIF 

? 'Kinases Fhoephatases: ' 

SORT ON SNTRY,NUM3ER FIELDS RFEH3,NU^IBER,L,D,F,Z,R, C, EOTRY, S, DESCRIPTOR. LEtTCTH, INIT, I, 
IP CaMDEN=l 

DO "Cbinpression f tine t ion. prg* 
ELSE 

DO 'Normal subroutine 1' 
7 'Tumor-related entigensi ' 

SORT ON ENTRY, NUMBER FIELDS RFSND, NU>BBR,L,D,P, 2, R,C, ENTRY, S, DESCRIPTOR. LEIGTO, mT, I, Ca^^^ 
IF CONDENsl 

DO •Compression function.prgr' 
ELSE 

DO "Normal subroutine !■ 

ENDIF 

♦EJECT 

7 ' PROTEIN S^OHETIC MACHINERY PROTEINS ' 

7 

7 'Transcription and Nucleic Acid-binding proteins: ' 

eORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F,Z,R,C,DTrRY,S,DESCRinOR, LENGTH, JOTT, I, COhMEN 
IP CONDENel 

DO 'Compression function. prg* 
ELSE 

DO "NoxBal subroutine 1" 

E2JDIF 

7 'Translation! ' . 

SORT ON ENTRY,NUMBER FIELDS RFESTD,NUb^BER,L,D,F, Z,R,C,EmTf^Y, S, DESCRIPTOR, LENGTH, INIT, I, CC^^ 
IF CONDEMol 

DO "Compression function. prg" 
ELSE 

DO 'Nozioai subroutine 1* 
7 'Ribosocial proteins:* 

SORT ON ENTRY, NUMBER FIELDS PJEiro,NUK£ER,L,D,F, Z,R,C, ENTRY, S, DESCRIPTOR, LmSltl.INIT, I, CO 
IP CONDENal 

DO 'Coirpressioa function.prg" 
ELSE 

DO 'Norro&l subroutine 1* 
ENDIF 

7 'Protein processing! ' 

SORT ON ENTRY, JTOMBER FIELDS RPEt©,^TJMBHR,L/D,P, Z,R,C,EiraiY, S, DESCRIPTOR, LENGTH, INIT# I, CO^^ 
IP CQNDENsl 

DO 'Conpression function .prg". 
ELSE 

DO ^Nozmal subroutine 1' 

El^XF 

*BJECT 

7 * ENZYMES' 
7 

? 'Ferroproteinsi ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, KCMBER,L.D,F,Z,R,C| ENTRY, S. DESCRIPTOR, LENGTH, INTT, I, CQJ^iai 
IF CQNDEN=1 

DO *Con9}ression function .prg" 

DO 'Noxnal subroutine 1* 
ENDIF 

7 'Proteases and inhibitors:' 

SORT ON DJTRY,NUMBER FIELDS RPSND,NUMEER,L,D,P, Z, R, C, ETTRY, S, DESCRIPTOR, LENGTH, INXT, I, COMMEN 
IF CONDENs:! 

DO •Coi^pression function.prg" 



65 



wo 95/2068 1 PCT/US9S/0 1 160 



DO "Norztial subroutine 1" 
? 'Oxidative phosphorylation: ' 

DO "Compreaaioft function, pro" 
ELSE 

DO "Nom&l subroutine 1" 
ETDIF 

7 'Sugar -jnetaboliami • 

f^^JcSs^^^'^^^ "^''^ ^'»t»=^3l,L.D,F,Z.R.C,ENrRy.S,DESCRIPT0a,U^rara.iraT/l,C^ 
DO "CoTrpression function, prg' 



DO "Normal subroutine V 
? •Amino acid metabolisci: • 

DO 'Ccanpression function. prg* 
ELSE 

DO "Nonnal subroutine 1* 
? 'Nucleic acid metabolismj • 

DO 'Compression function. pro" 

ELSE . 

DO ^'Normal subroutine 1* 
? 'Lipid netabolism: • 

DO 'Con^ression function ,prg» 

ELSE 

DO "Nonnal subroutine 1* 
EMDIP 

? » other enzymes I • 

IThS^'^^'''^^ ™^ ™'^»iBER,L,D.F.2.R,C,EOT^Y,s,DESCSIPTOR,LSNGra.lOTT.I^ 

DO 'Conrpreasion function. prg" 
ELSE 

DO 'Normal subroutine 1" 
EMDIP 

♦EJECT 

^ ' MISCELLANEOUS CA'KXWRIES' 

7 ' Stress * response : ' 

IT'LS^^''^^ I^.^^^^ER.L.D.F,2.R,c,nf^^y,s,DEsaRIProR.I^X7^^,Ir^T^^ 

DO 'Cot^^ression functioh.prc" 
ELSE 

DO 'Normal subroutine 1" 
ENDIF 

7 'Structural!' 

i?^?o£S^^'^®^ ^^Er©,NUMBER,L,D,F,Z,R,C,Emy,S,DESCRIPTOR,I^^ 
DO 'Conpression function. pro" 

DO ■NoCTval subroutine 1" 
aJDIP 

7 'Other clones i • 

I^oSe^^'"^®^'^ ' I" Z' R' C,ENTRy, S;DESCRIPTOR,La5GTO,TOIT, I,CCNMEN 

DO •Corrpression function. prg" 
ELSE 



66 



wo 95/20681 



PCT/US95/01160 



DO 'Nozinal subroutine 1" 
ENDIP 

? 'Clones of unknown function!' 

SORT CN ENTRV,NUMBER FIELDS I^EMD,NU^^3ER, L,D, P. 2, R,C,£tmiY,S, DESCRIPTOR, LHNG^ 
IF CONDENsl 

DO "CornpreBflion function •jsrg'' 

DO "Noxznal subroutine 1" 
ENDIP- 

IF C0NDEN=1 
EJECT 

*S£T DEVICE TO PRINIER 
♦SET PRI^5T ON 

DO 'Output heading .prg* 

USE 'Analysis function. dbf 
DO "Create bargraph.prg" 
SET KEADI^wG OFF 

SCREEN 1 TVPE 0 "Screen 1* AT 40,2 SIZE 2^6,492 PIXELS FONT "GenBvaM2 COLOR 0,0,0 

I ' TOTAL TOTAL NEW DIST 

? • FUNCTIONAL CLASS CLONES GENES GENES FUSOT'IQNAL CLASS' 

? • 

•LIST OV? FIELDS P, NA^IE,CLC23SS,G2NES, NEW, PERCENT, GRAPH, COlPANy 
LIST OFF FIELDS P,NAME, CLONES, GENES, NEW, PERCENT, GRAPH 
CLOSE DATABASES 
ERASE TEMP2.DBF 
SET HEADINS CN 

*USE ''arrartC3uy:FaxBASS+/Macifox files :TEMPMASTER. dbf • 
ENDIF 

CASE ANAL=8 

DO "Subgroup fiuracary S.prg" 
D^DCASE 

DO "Test print *prg" 
SET PRINT OFF 
SET DEVICE TO SCREEN 
CLOSE DATABASES 
♦ERASE TEMPLIB.C8P 
•ERASE TEMPNUM.CSF 
^ERASE TEMPDESXG • DBF 
*ERASE gPT.prron.nap 

CLEAR 
LOOP 



67 



wo 95/20681 



PCTAJS95/01160 



* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRWjlS 
USE 

COUNT TO TOT 

RSPLACB ALL RFEND WITH 1 

MAKKl = 1 

SW2bO 

DO WHILE S\^=0 ROLL 
IF MARKl >B TOT 
PACK 

COUNT TO UNIOOE 

COUNT TO NEWGENE3 FOR D=»H'.OR.D='0' 

SW2sl 

LOOP 

ENDIF 
■GO MARKl 
DUP s 1 

STORE WVRY TO TESTA 

SW e 0 

DO WHILE SW=0 TEST 

s:<ip 

STORE ENTRY TO TESTS 

IP TESTA = TESTE 

DELETE 

DUP = DUPi-1 

LOOP • 

ENDIF 
GO MARKl. 

REPLACE RFEND WITH DUP 
MARKl « KARKl-t-DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
GO TOP 

STORE Z TO IOC ' 

USE * Analysis location, dbf" 

LOCATE FOR Z»LOC 

REPLACE CLONES WITH TOT 

REPLACE CENES WITH UNIQUE 

REPLACE NEW WITH NEWGENES 

USE TEMPI 

SORT ON RFEND/D TO TEMP2 

USE TEt4P2 

77 STR(UNIQUE,5,0) 

?? ' genee, for a total of ' 

?? STR(Tar,5,0) 

77 * .clones* 

? * V Coincidence' 

list off fields n\3mber,RrEtlD,L,D,F, 2, R,C,5OT^Yr Si DESCRIPTOR, LH 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE I^^l.DBF 
ERASE TEI^2.DBF 
USB TEMPDESIQ 



6 8 



wo 95/20681 



PCTA3S95/01160 



♦ CCtt^PRESSIOM SUBROOTINS FOR ANALYSIS PROGRAMS 
USE TiWPl 

COUNT TO ax>r 

RBP'liACE ALL ^SWD WITH 1 

MAKKl e 1 

SW2«0 

DO WHILE SW2sO ROLL 
IP MARKl >= TOT 
PACK 

COTOT TO UNIQUE 

6W2=1 

LOOP 

ENDIP 
GO MARRl 
DUP = 1 

STORE ENTRY TO TESTA 
SW B 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 
IF raSTA = TESTS 
DELETE 

DOP « DUP+1 . ^ 

LOOP 
•ENDIP 
GO KARKl 

REPLACE RPEND WITH DUP 

MARKl a MARKl+DU? 

SW=1 

LOOP . 

ENDDO TEST 

LOOP 

WDOO ROLL 
*BROWSE 

•*SET FRlTJTOR ON 

SORT ON DATE TO TEMP2 

USE TQ1P2 

?? STR (UNIQUE, 4,0) 

?7 ' genes, for a total of* 

7? STR(TOT,4,0) 

?? clonea' 

? 

? ' V Cooj^cidence ' 

C0U^3T TO P4 FOR I»4 

IF P4>0 

? STR(P4,3,0) 

?? * genes with priority b 4 (Secondary analysis:)* 

list off fields number, RFEtTO,L,D,r,Z,R,C,E^TOY,S,DSSCRIPTOR,LS^raTH,IN^T for I«4 
? 

2NDIF 

COUNT TO P3 FOR 1=3 

IP P3>0 

? STR(P3,3,0) 

?? ' genes with priority c 3 (Pull insert sequence:)' 

list off fields number, RPEND,L.D,F,2,R,C,EmRY,S,DESCRI?TORiLa5CTK,INIT for 3=3 

* 

©©IP 

COUNT TO P2 FOR 1=2. 

IP P2>0 

? STR(P2r3,0) 

77 ' genes with priority » 2 (Primary analysis eostplete:)' 

list off fields number, KFEND,L,D,F,2, R,C, ENTRY, 6, reSCRIPTOR, LENGTH, INIT for 1=2 
? 

ENDIF 

COUNT TO Pi FOR 1=1 
IP P1>0 



6 9 



wo 95/20681 



PCT/US95/01160 



? STR(P1,3,0) 

?? • genes With priority := 1 (Primary analysis neededi}' 

eJSdip^" ^™^er,R?END,L,D,P,Z,R,c,5NmV,9,DESCRIPT0R,LENGTH,ira^ for Irl 



•SET PRINT OFF 
CLOSS DATABASES 
ERASE TEMPI, DBF 
ERASE TEMP2.DBF 

USE 'SmrtGi^iFoxBASE+ZMacifox f lies; clones, dbf 



wo 95/20681 



PCT/US95/01160 



♦ COJ^PPESSICN SU3R0OTINE FOR iiNALVSlS PROGRAMS 

USE TEMPI 

COm?T TO TOT 

REPUyCE ALL RFEMD WITH 1 

KARKl := 1 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARKl >s TOT 
PikCK 

COUNT TO UNIQUE 

5W2sl 

LOOP 

WDIF 
GO MARia 
PUP = 1 

fiTORE IMTRY TO TESTA 
fiW s 0 

DO WHILE SV^sO TEST 
SKIP 

STORE EOTRY TO TESTS 

IF TESTA = TESra 

DFTiKTE 

DUP s DUP4a 

LOOP 

ENDIF 
GO 21ARK1 

REPLACE RFEND WITH DUP 
MARKl c MARKl+DUP 
6W=1 
LOOP 

ENDDO TEST 
LOOP 

EI3DD0 ROLL 
*BR0WSE 

♦SET PRINTER ON • 
SORT ON NUMBER TO TEMP2 
USE TEWP2 

7? STR (UNIQUE, 4,0) 

7? ' genes, for a total of • 

77 STR(TOT.5,0) 

77 • Clones' 

^ * V Coincidence* 

list off fields nuittoer,Rl2H),L,D,F,z,R,c,ENTOY,s, descriptor, LEKGra 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI .DBF 
ERASE TEMP2.DBF 

USE •SmrtGuy:FoxBASE+/M&C!fox files : clones . dbf » 



7 1 



wo 95/20681 



PCT/US95/01160 



• COMPRESSION SUBROUTINE FOR Al^ALYSlS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE AUi RFEND WITH 1 

MftRKl =1 

DO WHTLE SW2aO ROLL 
IP MARKl >= TOT 
PACK 

COUNT TO UNIQUE 

COUNT TO NEW3ENES FOR Ife'H* .OR.D='0' 
6W2el 

LOOP 

Q^XF 
GO W^BKi 
DUP - 1 

STORE EQTRY TO TESTA 
SW A b 

DO WHILE SW=0 TEST 
SEOP 

STORE ENmy TO TEST3 

IF TESTA = TSSTB 

DELETE 

DUP = DUP+1 

LOOP 

ENDIP 
C30 MARKl' 

REPLACE RFEND WITH DUP 
MARKl « MARKl+DUP 

SW=:1 

LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
GO TOP 

STORE R TO FUNC 
USE * Analysis function, dbf 
LOCATE FOR P=FUNC 
•REPLACE CLONES WTIH TOT 
REPLACE GENES WITH UNICJUE 
REPLACE NEW WITH NEWGENES- 
USE TEMPI 

SORT GM RFE1;3D/D TO TEMP2 

USE TEMP2 

SET HEADi^ ON 

?? STR (UNIQUE, 5,0) 

?? • genes, for a total of ' 

77 STR(T0T,5,0) 

?? * clones' 

? ' * V Coincidence' 

list off fields rnrciber,RFElTO,L,D,FrZ,R,C,E^mlY,S, DESCRIPTOR, LaTOH,m 

vir* 

♦SCREEN 1 TVPE 0 HEADING "Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT "GenevaM3 COLOR 0,0, 
♦liet cf£ fieXda RFEND, S, DESCRIPTOR 

♦SET PRINT OFF 
CLOSE DATABASES 
ERASE TEKPl.DHF 
ERASE TEK?2,DBF 

USE T^^^F^ESIG 



72 



wo 95/20681 



PCT/US95/01160 



* CCMPRESSION SUBROUTINE ?0R ANALYSIS PROGRAMS 

USB Ta^Pl 

COUNT TO TOT 

REPLACE ALL RF£^ WITH 1 

MARKl n 1 

CO WHILE SW2=0 ROLL 
IF MARKl >a TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

S2QDIF 
GO MARKl 
DUP = 1 

STORE EWTRV TO TESTA 
£W B 0 

T30 WHILE SWsO TEST 
SKIP 

STORE ENTRY TO TOSTB 
IF TESTA " TESTS 

PUP a DOP+l 
LOOP 

00 MARKl 

REPLACE RFEiro WITH DUP 
MARKl = MARKl+DUP 

lOOB 

ENDDO TEST 
LOOP 

E^^DDO ROLL 
GO TOP 

STORE P TO DIST 

USE *• Analysis distribution. dbf 

LOCATE FOR Ps^DlST 
REPLACE CLONES WITO TOT 
REPLACE GENES WITH UNIQUE 
USB TEMPI 

cqrfc on rfftnd/d to TEMP2 

USE THMP2 

?? CTR (UNIQUE, 5,0) 

7? • genes, for a total of * 

7? STR(T0T,5,0) 

77 ' clones* 

7 ' V Cotncldencfi' 

list off fields nurhber,RPE24D,L,D,P, Z,R,C,E5TOY, S, DESCRIPTOR, LEH^ 

*SE?r PRINT QPP 
CLOSE DATABASES 
ERASE TEMPI .DBF 
.ERASE TSMP2>DBF 
USE TEMPDESIG 



73 



wo 95/20681 



PCr/US95/01160 



♦ COMPRESSION SUBROUTINE FOR ANALYSIS PROG^WIS 

USB TEMPI 

COUNT TO TOT 

RSPL^CE ALL RFSND WITH 1 

SW2-0 

130 V2HILE SW2=0 ROLL 
IF WARKI >" TOT 
PACK 

COUNT TO UNIQUE 
LOOP 
GO MARKl 

Dup B a 

STORE ENTRY TO TESTA 
SW n 0 

DO WHILE SW=0 TEST 
SKIP 

STOP£ ENTRY TO TESTS 
IF TES TA B TiltiTB 
DELETIE 
COP .B 
LOOP 

GO MARKl 

REPLACE -RFEND WITH DUP 
MARKl B MAHKl+DUP 
SW=1 
LOOP 

3©D0 TEST 
LOOP 

ENDDO ROIIi ■ 
GO TO? 
USE TE^l 

?? STR (UNIQUE, 5,0) 

77 • genes, for a total of • 

?? STR(T0T,5,0) 

?? ' clones' 

? ' V Coincidence * 

list off fields nuihber,RFEMD,L,D,F,Z,R,C,EtmiY,S, DESCRIPTOR, L£I^^ 

« 

*SET PRINT OPP 
CLOSE DATABASES 
ERASE TEMPI. DBF 
USE TEMPDESIG 



74 



wo 95/20681 



PCT/US95/01160 



L^^^^^^^^ SUBROUTINE FOR ANALYSIS PROGRAMS 

JSL ^i3SiJr'£?^^^/^*='="«^ fil€S:Clones.(abf- 
COPY TO TEMPI FOR 

USE TE^lPX 

D^OE FOR D. N«.OR.D=»D'.OR.D='A'.OR.D=iU'.OR.D='S'.OR,D='M'.OR.D::»R<.OR.D='V« 
COUWr TO TOT 

REPLACE Ali RFEND WITH 1 
MARKX = 1 

SH2sO 

DO WHILE SW2=0 ROLL 
IP MARKl >= TOT 
PACK 

COUNT TO W^QUE 

SW2sl 

LOO? 

ENDIF 
GO MARKl 
DUP B 1 

STORE ENTRY TO TESTA 
SW B 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 
IF TESTA B TESTS 
DELETE" 
DUP s DtJP+1 
LOOP 

GO MARiCl 

REPLACE RFEND WITH DCJP 
MARKl = MARKl+DUP 
Sfel 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
*BROWSE 

♦SET PRINTER ON 

SORT ON RFEND/D, NUMBER TO TEMP2 
USE TEMP2 

REPLACE ALL START WITH RFEND/IDGENE*10000 

?? STR (UNIQUE, 5,0) 

7? * genes, for a tot&l of • 

7? STR(TOT,5,0} 
77 ' clones' 

7 • Coincidence V v Clones/10000' 

set beading off 

^J^^J^""'^^ ^0,2 SIZE 286,492 PIXELS TOMT "Geneva- 1 COLOR 0 0 0 

*^ J^0^^^'^^'5TART,L,D,P,Z,R,^ ^'^,0, 

CLOSE DATABASES 
ERASE TEMPI, DBF 
ERASE TEM?2«DBF 

USE 'SinartGuyiFoxBASEt/Macifox files: clones. dbf- 



7 5 



wo 95/20681 



PCT/US95/01160 



* (XMPRESSICa^ SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TBtdPl 

COUNT TO IDGENB FOR D='3' ,OR,D='0' .OR.D='H\OR.D='N' .OR.D='R- .OR.D='A' 

DELETE FOR D= 'N' .OR.D= 'D' .OR.Dr'A' .OR.D= 'UVOR.D^ 'S' .OR.Dh 'M\OR.D='R» .OR.D» 'V 
?AdC ' 

COUNT TO TOT 

REPLACE ALL RFE^JD WITH 1 

MARKl = r . • 

SH2=0 

to WHILE SW2=0 ROLL 
IP MARKl >= TOT 
PACK 

COUNT TD UNIQUE 

LOOP 

ENDIF 
GO MARXl 
DOP B 1 

STORE ^TTRY TO TESTA 
SW B 0 

DO WHILE SW=0 TEST 
SKIP 

STORE EfTOlY TO TESTE 

IP TESTA = TEST3 

DELETE 

EOT - DUP+1 

LOOP - 
"EbJDIP 
GO MARXi 

REPLACE RPEND WITH DUP 
MARKl B MARXX+DUP 

LOOP 

ENDDO TEST 
LOOP 

ENTDDO ROLL 
♦BROWSE 

*SET PRIOTSa ON 

SORT CN RFEND/D, NUMBER TO TEa^P2 
USB TEMP2 

REPLACE ALL START WITH RFEND/IDGENE*10000 

?? STR (UNIQUE, 5,0) 

7? ' genes, for a Wtal of • 

77 STR(TOr,5,0) 

77 ' clones' 

? • Coincidence V v Clones/lOOOO' 

set heading off 

SCREEN 1 TYPE 0 HEAbiNG 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT -Geneva", 7 COLOR 0,0,0, 

ll8t fields nuinber,RFEND,START,L,D,?,2;R,C,ElWTiY,S,DESaRlPTOR;iraiT,I 

*SE?r PRINT OFF - 

CLOSE nmBASES 

ERASE TEMPI, DBF 

ERASE TEM?2,DBP 

USE "SmartGuyiFoxBASE+ZMacifox filestclanes.dbf 



7 6 



wo 95/20681 



PCTAJS95/01160 



USB TEMPI 
QOUNT TO TOT 

?? ' Total of* 

?? STR(TOr,4,0) 
?7 • clones* 
7 

*list off fields nuiIlber,L,D,F,2,RrC,ENTRY,rESCRIPIOR,I^GrH,RFEOT),INIT,I 
list off fields number, L,D,Fi2,R,C,ENTOy,DESCRIPT0R 

CLOSE d;itabas£S 

ERASE. TEMPI. DBF 
USE TEMPDESIG 



wo 95/20681 



PCTAJS95/01160 



♦Lifescan menu; version 8-'7-$4 
SET TAUC OFF 

Bet device to screen 
CLEAR 

USB ''£irartGiv:FoxBASE+/Mac;fox files: clones, dbf 
STORE LUPDATEO TO Update 
GO BOTTCM 

STORE PECNOO TO cloneno 
STORE 6 TO Chooser 
DO WHILE .T. 

* Program*: Lifeseg menu.fmt 

* Date.... I 1/11/95 

.* Version.: PoxEASE+/Mac, revision 1.10 

* Notes. • • * : Format file Lifessq inenu 

SCREEN 1 TYPE 0 HEADINS "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva" ,268 COLOR 0,6, 
Q PIXELS 18,126 TO 77,365 STVLE 26479 COLOR 32767,-25600,-1,-16223,-16721,-15725 
0 PIXELS 110,29 TO 188,217 SmS 3B71 COLOR 0,0,-1, -25600, -1|-1 

d PIXELS 45,161 fifty "LIFESEQ" SIYLS 65536 FONT 'Geneva', 536 COLOR 0.0,-1,-1,7135,5884 

0 PIXELS 36,269 SAY "TO" SIYLS 65536 FONT 'Geneve-, 12 COLOR 0,0,-1,-1,7135,5884 

0 PIXELS 63,143 SAV •Molecular Biology Desktop' STYLE 65536 FONT 'HelveticaMS COLOR 0,0,0, 

e PIXELS 90,252 TO 251,467 STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 

0 PIXELS 117,270 GET Chooser STYLE 65536 FONT •ChicagoM2 PICTURS 'O+RV Trcuiscript profiles 

PIXELS 135,128 SAY Update SIYLE 0 FONT •6enevaM2 SIZE 15,79 COLOR 0,0,0,-25600,-1,-1 ' 
e PIX2X.S 171,128 SAY cloneno STYLE 0 FONT 'GenevaMa SIZE 15,79 COLOR 0,0,0, -25600, *-l',-l 
© PIXEL5 135,44 SAY "Last updates" STYLE 65536 FONT 'GenevaM2 COLOR 0,0,-1,-1,-1,-1 
0 PIXELS 171,44 SAY "Total clonesi" ST^fLE 65536 FOOT ''GenevaM2 COLOR 0,0,-1,-1, -l,-! 
0 PIXELS 45,296 SAY "vl.30" STYLE 65536 FOOT 'Geneva", 782 COLOR 0,0,-1,-1,-1,-1 

* EOF: Lifeseq menu- fine 

READ 
DO CASE 

CASE Chooserd 

DO "SmartGuyiFoxEASE+ZMactfox filesiOutput programs* Master analysis 3.prg" 
CASE Chooaert=2 

DO ■SniartGiayiFoj«3ASE+/Mac;fox files: Output prograius: Subtraction 2.prg'' 

'CASE Chooser=3 

DO "SniartGuyiFoxBASE+/Mac:fox files sOutput programs: Nor them, (single) ".prg" 
CASE Choosers4 

USE •Libraries, dbf- 
BROWSE 

CaIse Chooser»5 

DO ■SroartGijy;FoxEASE+/Mac:fox files: Output programs i See individual clone. prg" 
CASE Chooeere6 

DO *SniartGijy;FoxBAS£+/Hac;fox files: Libraries i Output programs: Menu .prg' 

CASE Chooser=7 

CLEIAR 

SCREEN 1 OFF 

RETORN 

HNDCASE 

LOOP 



7 8 



wo 95/20681 



PCT/DS9S/01160 



01,30 SAY "Database Subset Analysis' STYLE 65536 FONT ''Geneva\274 COIjOR 0,0,0,-1,-1,-1 

7 
? 
? 

7 dateO 
?? * * 
7? TIMEO 

? 'Clone nvjnibers * 
?? STR (INITIATE, 6,0) 

?? ' through ' • 

?? STR (TERMINATE, 6,0) 

7 'Libraries: ' 

IP ENTIRE=1 

7 'All libraries* 

EHDXF 

IF aJTIRE=2 
MARKal 
DO WHILE .T. 
IF MARK>STO?IT 
£5C1T 

©roiF 

USS SELECTED 
GO MARK 
7 ' ' 

77 TRIMdibnazao) 
STORE MARK+1 TO MARK 
LOOP 

moDo 

ENDIF 

? 'Designations! • 

IF EraatchsO .AND. Hmatch=0 .AND. Clratch=0 

?? 'All' 

ENDIF 

IF Enaatchd 
?? 'Exact,' 
ENDIF 

IF Hmatchnl 
?? 'Human, ' 
SNDIP 

IF Qmatch=l 
?? 'Other sp. ' 
ENDIF 

IF CQtTDENsl 

? 'Condensed format analysis' 

BNDIP 

IF ANAL=1 

7* 'Sorted by NUMBER' 

ENDIF 

IF ANAL=2 

? 'Sorted ty EOTRY' 

E^®IF 

IP ANAL=3 

7 'Arranged by ABUNDANCE* 

EJ3DIP 

IF ANALs4 

? 'Sorted by INTEREST' 

ENDIF 

IP AMAL=S 

7 'Arranjed b/ LOCATION' 
ENDIF 

IF ANAL-6 

7 'Arranged ty DISTRIBUTION' 

ENDIF 

IF ANALa? 

7 'Arrangad by FUNCTICN' 



79 



wo 95/20681 



PCTAJS95/01160 



ENDXF 

? '* Total clones represented : 

77 STR(STARTOr,6.0) 

7 'Total clones analyzedi * 

?? STR(AMALTOT,6,0) 

? 

7 



80 



wo 95/20681 



PCT/US95/01160 



USE TEMPI 
COUNT TO TOT 
?? ' Total of* 
?? STR(TOT,4,0) 
' clones* 

? 

*lifit off fields nuiriber,L,D,Fi Z,R,C, ENTRY, DESCRIFTOR^LE^GTHiR^ 
list off fields nuiriber,L,D,F*Z,R,Cr ENTRY, DESCRIPTOR 
CUDSB a^ABASBS 
ERASE TEMPI. DBF 
USE TEMPDESIG 



81 



wo 95/20681 



PCTAJS95/01160 



USE TEMPI 
COUNT TO TOT 
?? * Total Of 
?? STRCTOf^^^O) 
?? • clones! 
? . 

*list off fields nuniber,L,D,P,Z,R,C,ENTRy,DSSCiaPTOR,I£NGTO,RFEN^ 
list off fields nurober,D,D,P,2,K|C,EHTRV,DESCRI?T0R 
CLOSE CATABASSS 
HV^B TEMPI <DB? 
USE TEMPOESIG 



82 



wo 95/20681 



PCT/US95/01160 



♦Northern (single), version 11-25-94 

close databases 

SET TALK OFF 

SET PRINT OF? 

SET EXACT 0?F 

CLEAR 

STOKE ' ' TO Eobject 

STORE ' 'TO Dobject 

STORE 0 TO Numb 

STORE 0 TO Zog 

STORE 1 TO Bail 

DO WHILE ,T. 

* Program.: Northern (single). fmt 

* Date....: B/ 8/94 

* Version.: FoxBASE-^/Mao, revision 1.10 

* Motes 1 Format file Northern (single) 

* 

SCREEN 1 TifPE 0 HEADING "Screen 1^ AT 40,2 SIZE 286,492 Pim^S FONT ■GenevaM2 COLOR 0,0,0 
@ PIXELS 15,81 TO 46,397 STYLE 28447 COLOR 0,0,-1, -25600,-1^-1 . 
0 PIXELS 89,79 TO 192,422 STOiS 28447 COLOR 0,0,0,-25600,-1,-1 

© PIXELS 115,98 SAY "Entry STYLE €5536 PONT •GenevaM2 COLOR 0,0,0,-1,-1,-1 
@ PIXELS 115,173 GET Eobject STYLS 0 FONT ''GenevaM2 SIZE 15,142 COLOR 0,0, Oi-l, -1,-1 
Q PIXELS 145,89 SAY "Description" STYLE 65536 PONT "GenevaM2 COLOR 0.0,0,-1,-1,-1 
@ PIXELS 145,173 GST Dobject STYu; o FONT 'GenevaM2 SIZE 15,241 COLOR 0,0,0,-1,-1,-1 
@ PIXELS 35,89 SAY "Single Northern search screen" STYLE 65536 FONT •Geneva", 274 COLOR 0,0,- 
@ PIXELS 220,162 GET Bail STYLE 65536 PONT 'Chicago", 12 PICTURS "^^R Concinue;Bail ouf SIZE 
Q PIXELS- 175,98 SAY "Clone. #:" STYLE 65536 FONT "Geneva";12 COLOR 0,0,0,-1,-1,-1 
@ PIXELS 175,173 GET Numb STYLE 0 FONT "Geneva", 12 SIZE 15,70 COLOR 0,0,0,-1,-1,-1 * 
•0 PIXELS 80,152 SAY 'Enter any ONE of the follovdng:' STYLE 65536 FOOT -Geneva', 12 COLOR -1, 

* EOF: Northern (single). fmt 

RE3^ 

IP Bail«2 
CLEAR 

screen 1 off 
R3TORN - 
BNDIP 

USE ^SmartG;y:FoxBASE+ /Macs Fox files :LooXup,dlbf" 
SET TALK'ON • 

IF Eobjecto' . • 

STORE UPPER {Eobject) to Eobject 

SETT SAFETY Cff'P 

SORT O N En try TO "Lookup entry. dbf 

SET SAFETY ON 

USE "Lookup entry, dbf" 

LOCATE FOR LooJ«^Eobject 

IF •NOT^FOUNDO 

CLEAR 

LOOP 

ENDIF 

BROWSE 

STORE Entry TO Searchv^ 
CLOSE DATABASES 
ERASE "Lookiq?' entry. dbf" 
ENDIF 

IP Ddbjecto' • 
SET EXACT OFF 
SET SAFETY OFF 

SORT ON descriptor TO "Lookup descriptor. dbf 

SET SAFETY On 

USB "Lookup descriptor. dbf" 

LOCATE FOR UPPER (TRIM (descriptor) )=UPPER (TRIM (Dobject) ) 
IF .tJOT.FCUNDO 

CLEAR 



83 



wo 95/20681 



PCTAJS95/01160 



UOOP 

ENDIP 

BROWSE 

STORE Entry TO Searchval 
CDOSS DATABASES 
ERASE "Lookup deecriptor.dbf" 
SET EXACT GN 

IP NmriboO 

USE ■firnartGuy:PoxBASE+/MactFo>t filesrclones.dbf ' 

GO Kuntb 

BROWSE 

STORE Entiy TO Searchval' 
ENDIP 

CLEAR 

7 'Northern analysis for entry ' 
?? Searchval 

■ 

? 'finter Y to proceed* 

WAIT TO OX 

CLEAR 

IP UPP2R(0K)<>*y 
screen 1 off 
RETURN 
ENDIF ■ 

* COMPRESSION SUBROUTINE FOR Libraryidbf 
? 'Compreasing the Libraries file now. , . * 

USE ' S mart Guy ; FoxBASEt /Mac i Fox files:librarieB.dbf " 
SET SAFETY OFF 

SORT cm library TO "Contpressed libraries. dbf 

* FOR eiitered>0 
SET SAFSry ON 

USE "Contpressed libraries.dbf " 

DELETE FOR entered«0 

PACK 

COUm* TO TOT' 
MARKl » 1 
SW2iaO 

DO WHILE SW2=0 ROLL 

IF MARKl >- TOT 

PACK 

SM2=1 . 

LOOP 

ENDI? 
GO MARK! 

STORE library TO TESTA 
SKIP 

STORE Library TO TESTB 
IP TESTA = TESTB 
DELETE 
ENDIF 

MARXl > l^ARKl+1 
LOOP 

ENDDO ROLL 

* Northern analysis 
CLEAR 

7 'Doing the northern now. . . * 
SET TALK ON 

USB *$martGuy:PoxBASE-t>/Mac:Fox files: clones. dbf* 
SET SAFSrr OFF 

COPY TO 'Hits, dbf ■ FOR entry=:searchval 
SET SAFETY ON 



8A 



wo 95/20681 



PCr/US95/01160 



CLOSE DATABASES 
SEtECT 1 

XJ3B 'Conpressed libraries. dbf" 

STORE REXXOOWrO TO Entries 

SELECT 2 

USE "HitB.dbf" 

Maries:! 

DO WHILE .T. 

SELECT 1 . 

IF Mark>Entries 

EXIT 
ET^DIF 
GO MARK 

STORE library TO Jigger 
SELECT 2 

COUNT TO Zog FOR library5=Jigger 
SELECT 1 

REPLACE hits with Zog 

MarkssMark+1 

LOOP 

SELECT 1 

BROWSE FIELDS LI^RARV,LIBNAME. ENTERED, HITS AT 0,0 
CLEAR 

? »Enter Y to print:' 

WAIT TO PRINSer 

IP UPPER (PRINSET) a ' Y ' 

SET PRINT ON 

CLEAR 

EOTCT- 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS PONT ■Gen€fvaM4 COLOR 0,0,0 

? 'DATABASE jamiES MATCHING EOTRY ' 

?? Searchrol 

? DATEO 

? 

SCREEN 1 TYPi: 0 HEADING "Screen 1" AT 40;2 SIZE 286,492 PIXELS FONT "Geneva",? COLOR 0,0,0, 

LIST OFF FIELDS library, libnair.e, entered, hits 

• 

? . 

SELECT 2 

LIST OFF FIELDS NUMBER, LIBRARY, D,S,F,Z,R,2Nniy, DESCRIPTOR, RFSTART, START, RFEND 
SET TALK OFF 
SET PRIWT OFF 
EMDIP 

CLOSE DATABASES 
SET TALK OFF 
CLEAR 

DO 'Test print .prg" 
RETURN 



85 



wo 95/20681 



PCTAJS95/01160 



TABLE 6 



library 

AO£NtNB01 

ADRENOR01 

AOHENOTOI 

AMLBMOTOi 

BMAflNOTOI 

eMARN0TQ2 

CARDNOTOl 

CHAOhJOTOI 

00RNN0TD1 

F1BRAGTQ2 
FlEnANTOI 

nSRNGTO 
FIfiHNOTOl 

HMC1N0TX)1 

HUVaPBOl 

HUVENOB01 

HLTVESTBOi 

HYPONOB01 

KIDNNOT01 

UVRM0TO1 

LUN6N0TD1 

MUSCNOT01 

OV1DNO&01 

PANCNOTOn 

prruNOROi 

PmjNOTOI 
PLACNOB01 

SPIWFETOI 

SPUiNOTOa 

STOMNOTDI 

6YNORAE01 

7BLVNOTD1 

TCSTNOT01 

THP1NOB01 

7HP1PEB01 

THP1PLB01 

U937NOT01 



libneme 
Inflamed adenoid 
Adrenal gland (r) 
Adrenal gland (T) 
AML blast ceUs (T) 
Bone marrow 
Bone marrow (T) 
Cardiac muscle (V) 
Chtru hamster ovary 
ComQBl stroma 
Fibnoblesi, AT 5 
Fibrobiaat. AT 30 
Fibroblast AT 
Fibroblasl, uv 5 
Fibroblast, uv SO 
Fibroblast 
Rbroblast. normal 
Mast cell line HMC-1 
HUVEC1FN,TNF,LPS 
HUVEC control 
HUVEC shear stress 
HypothBlamus 
Kidney CO 
UverfO 
Lung (T) 

Skeletal mu&de (T) 
O^duct 

Pancreas, nonrol 
Pituitary (r) 
Pituitary {!) 
PiBcenta 

SmslJ intestine fT) 
Spleentliver, fetol 
Spleen (T) 
Stomach 
Rheum, synovium 
T 4 B lymphoblast 
Testis (T) 
THP-l control 
THP phorbol 
THM pharbol LPS 
LI337, monocytic teuk 



numberlibrary 

2304 UB37NOT01 

3240 HMC1NOT01 

a2G9 HMC1NOT01 

4€93 HMC1NOT01 

8989 HMC1NOT01 

9139 HMCINOTOI 



d s f 2 r entry 
E H C C T HUMEF1B 
E H C C T HUMEF1B 
E H C C T HUMEFiB 
E H C C T HUMEFIB 
E H C C T HUMEFIB 
E H C C T HUMEFIB 



descriptor 
EJongallon lador t-beta 
Elongation fador 1-beta 
Elongaiion factor 1-beta 
Elongation factor i-bete 
Elongation (actor i-beia 
Elongation factor i*beta 



rf etariata rt 


rfend 


V- 0 


773 


0 370 


773 


0 371 


773 


0 470 


773 


0 327 


773 


0 375 


773 



86 



wo 95/20681 



PCT/US95/01160 



WHAT IS CLAIMED IS; 

1. A method of analyzing a specimen containing gene 
transcripts^ said method comprising the steps of: 

(a) producing a library of biological sequences; 
5 (b) generating a set of transcript sequences, where 

each of the transcript sequences in said set is indicative 
of a different one of the biological sequences of the 
library; 

(c) processing the transcript sequences in a 

10 programmed computer in which a database of reference 

transcript sequences indicative of reference biological 
sequences is stored, to generate an identified sequence 
value for each of the transcript sequences, where each said 
identified sequence value is indicative of a sequence 

15 annotation and a degree of match between one of the 

transcript sequences and at least one of the reference 
transcript sequences; and 

(d) processing each said identified sequence value to 
generate final data values indicative of a number of times 

20 each identified sequence value is present in the library. 

2. The method of claim 1, wherein step (a) includes 
the steps of : 

obtaining a mixture of mRNA; 

making cDNA copies of the mRNA; 
25 isolating a representative population of clones 

transfected with the cDNA and producing therefrom the 
library of biological sequences. 

3. The method of claim 1, wherein the biological 
sequences are cDNA sequences. 

30 4. The method of claim 1, wherein the biological 

sequences are RNA sequences. 

5. The method of claim 1, wherein the biological 
sequences are protein sequences. 



87 



wo 95/20681 



PCT/US95/01160 



6. The method of claim 1, wherein a first value of 
said degree of match is indicative of an exact match, and a 
second value of said degree of match is indicative of a 
non-exact match. 

5 7. A method of comparing two specimens containing 

gene transcripts, said method comprising: 

(a) analyzing a first specimen according to the 
method of claim 1; 

(b) producing a second library of biological 
10 sequences; 

(c) generating a second set of transcript sequences, 
where each of the transcript sequences in said second set 
is indicative of a different one of the biological 
sequences of the second library; 

15 • (d) processing the second set of transcript sequences 

in said programmed computer to generate a second set of 
identified sequence values known as further identified 
sequence values, where each of the further identified 
sequence values is indicative of a sequence annotation and 

20 a degree of match between one of the biological sequences 
of the second library and at least one of the reference 
sequences; 

(e) processing each said further identified sequence 
value to generate further final data values indicative of a 

25 number of times each further identified sequence value is 
present in the second library; and 

(f) processing the final data values from the first 
specimen and the further identified sequence values from 
the second specimen to generate ratios of transcript 

30 sequences, each of said ratio values indicative of 

differences in numbers of gene transcripts between the two 
specimens. 

8. A method of quantifying relative abundance of mRNA 
in a biological specimen, said method comprising the steps 
35 of: 

(a) isolating a population of mRNA transcripts from 
the biological specimen; 



88 



wo 95/20681 



PCT/US95/01160 



(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
corresponding to each of the genes; and 

(d) using the mRNA transcript numbers to determine 
the relative abundance of mRNA transcripts within the 
population of mRNA transcripts. 

9. A diagnostic method which comprises producing a 
gene transcript image, said method comprising the steps of: 

(a) isolating a population of mRNA transcripts from a 
biological specimen; 

(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
corresponding to each of the genes; and 

(d) using the mRNA transcript numbers to determine 
the relative abundance of mRNA transcripts within the 
population of mRNA transcripts, where data determining the 
relative abundance values of mRNA transcripts is the gene 
transcript image of the biological specimen. 

10. The method of claim 9, further comprising: 

(e) providing a set of standard normal and diseased 
gene transcript images; and 

(f) comparing the gene transcript image of the 

25 biological specimen with the gene transcript images of step 
(e) to identify at least one of the standard gene 
transcript images which most closely approximate the gene 
transcript image of the biological specimen. 

11. The method of claim 9, wherein the biological 
30 specimen is biopsy tissue, sputum, blood or urine. 

12. A method of producing a gene transcript image, 
said method comprising the steps of 

(a) obtaining a mixture of mRNA; 

(b) making cDNA copies of the mRNA; 



10 



15 



20 



89 



4 

WO 95/20681 PCT/US95/01 160 

(c) inserting the cDNA into a suitable vector and 
using said vector to transfect suitable host strain cells 
which are plated out and permitted to grow into clones, 
each clone representing a unique itiRNA; 
5 (d) isolating a representative population of 

recombinant clones; 

(e) identifying amplified cDNAs from each clone in 
the population by a sequence-specific method which 
identifies gene from which the unique mRNA was transcribed; 
10 (f) determining a number of times each gene is 

represented within the population of clones as an 
indication of relative abundance; and 

(g) listing the genes and their relative abundance in 
order of abundance, thereby producing the gene transcript 
15 image. 

13. The method of claim 12, also including the step 
of diagnosing disease by: 

repeating steps (a) through (g) on biological 
specimens from random sample of normal and diseased humans, 
20 encompassing a variety of diseases, to produce reference 
sets of normal and diseased gene transcript images; 

obtaining a test specimen from a human, and producing 
a test gene transcript image by performing steps (a) 
through (g) on said test specimen; 
25 comparing the test gene transcript image with the 

reference sets of gene transcript images; and 

identifying at least one of the reference gene 
transcript images which most closely approximates the test 
gene transcript image. 

30 14, A computer system for analyzing a library of 

biological sequences, said system including: 

means for receiving a set of transcript sequences, 
where each of the transcript sequences is indicative of a 
different one of the biological sequences of the library; 

35 and 

means for processing the transcript sequences in the 
computer system in which a database of reference transcript 



90 



wo 95/20681 PCTAJS95/01160 

sequences indicative of reference biological sequences is 
stored, wherein the computer is prograimned with software 
for generating an identified sequence value for each of the. 
transcript sequences, where each said identified sequence 
5 value is indicative of a sequence annotation and a degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 
transcript sequences, and for processing each said 
identified sequence value to generate final data values 
10 indicative of a number of times each identified sequence 
value is present in the library. 

15. The system of claim 14, also including: 

library generation means for producing the library of 

biological sequences and generating said set of transcript 
15 sequences from said library. 

16. The system of claim 15, wherein the library 
generation means includes: 

means for obtaining a mixture of mRNA; 

means for making cDNA copies of the mRNA; 

20 means for inserting the cDNA copies into cells and 

permitting the cells to grow into clones; 

means for isolating a representative population of the 

clones and producing therefrom the library of biological 
sequences. 



91 



wo 95/20681 



PCTAJS95/01160 



SYBASE database Structure 

Library Preparation 



Collaborator 

Number 

Name 

Address 

Phone 

Fax 

Email 

Projea 



Cell Supplier 

Number 

Name 

Address 

Phone 

Fax 



Biological 

Source 

Number 

Tissue 

Organ 

Gender 

Age 

Species 

Race 

Pathology 

Disease stage 

Tissue weight 

Source 

Lot 

PO 

Comments 



r 



Treatment Link 
Treatment name 
Culture ID 



Culture 

Number 

Source 

Lot 

PO 

Date 

Cell density 

Quantity 

Protocol 

Treatment 

Comments 



mRNA Prep 

Number 
Culture 
Lot 
Date 
Lapse 
Quantity 
Weight 
Protocol 
RNA yield 
mRNA yield 
% yield 
Modifications 
Gel appearance 
Comments 




cDNA Supplier 

Number . 

Name 

Address 

Phone 

Fax 







cDNA 




Construction 1 


Number 




Prep # 




Library code 


Supplier 




Type 




mRNA used 


vector 




• primer 




directions 


Ay sixe 




Date ship 


Catalog # 


Price 




Date rec 




Cloning sites 


Primary 


size 


Background 


Unamp titer 


Amp titer 


Host strain 


Genotypes 


Actin 




Amplification 


Comments 



Figure 1 



PCTAJS95/01160 



A,, 




'TV.c^^ O^C So T^rr-^ ivA*! bf^ 



Figure 2 



wo 95/20681 



PCTAJS95/01160 





4 




Figure 3 



♦ 



3/4 



wo 95/20681 PCTAJS9S/01160 

Incyte Bioinformatics Process 




Figure 4 



INTERNATIONAL SEARCH REPORT 



Intemationai application No. 
PCTAJS95/01160 



A. CLASSinCATION OF SUBJECT MATTER 

IPC(6) :C12Ql/68; G06F 15/00 

US CL : 435/6; 364/413.02 
According to Intemationai Patent Claaiification (IPC) or to both national classification and IPC 



B. FIELDS SEARCHED 



Minimum documentation searched (classification system followed by classification symbols) 
VS. : 435/6; 364/413.02 



Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practicable, search terms used) 
CAS ONLINE, APS, transcript, transcripts, cdanjt^, mrna#, frequenc?, distribut?, abundanc? 



C. DOCUMENTS CONSIDERED TO BE RELEVANT 



Category^ 



Citation of document, with indication, where appropriate, of the relevant passages 



Relevant to claim No. 



IntelliGenetics Suite, Release 5.4, Advanced Training Manual, 
issued January 1993 by IntelliGenetics, Inc. 700 East El 
Canr>ino Real, Mountain View, California 94040, United 
States Of America, pages (1-6)-{l-19) and (2-9)-{2-14), see 
entire document. 

Science, Volume 252, issued 21 June 1991, M.D. Adams et 
al, "Complementary DNA sequencing: Expressed sequence 
tags and human genome project", pages 1651-1656, see 
entire document. 



15 and 16 



1-14 



1-16 



fxl Further documents are listed in the continuation of Box C. | | See patent hmHy annex. 



•A' 

"O" 



docanieatdcfiiisittfK|eaenl«iltof Ibeinwhidi ■ not oootidered 
to be of puticukr i^Bnnce 



bterdocumoitpiibUihal after the utcnMOioeal fUnii dite or prioiitjr 
dale apd m conflks viih the applkmiioD bia cited to I 
piincip l e or tfaeory aadertyinc ifae invcatiaa 



ceriier docinncBi puUiihed oo or after Ae 



ftliaf 



•X* 



document whicfa nMy ihrow doubii ao priorhy ckiraCa) or which 
eked ID cMbUdk Ifac iwihHratif date of aaolfaer aauioD or 
•pecial fcaaoB apecified) 

docuoiciit tefeninff to ao oral diKkmire. tiae, exhibhioo or other 



fQmt dale bta later than 



iffrniniTr^ of paiticidar rekvaooe; die daimed inveatioo canool be 
couaidefed oovcl or cannoi be conaidend to isvoNe an inventive itq> 
wfacD the <in < 'Mmfnt m taken atooe 



document of paiticuhu' relevaoce; die chimed inventian cannot be 
oooiiderad to iBvokve an inventive atcp when the docomeat ia 
combiaed with one or mote other such documents, such combmatioo 
obvioua to a pefsoo akilled m die art 



document publidied prior 10 
the priority date churned 



Ai^ Ill iiw nnt^ » M th^ patgfit famity 



Date of the actual completion of the international search 



27 APRIL 1995 



Date of mailing of the intemationai search report 

0 4 MAY 1995 



Name arid mailing address of the ISAAJS 
Commiaaioner of Patenu and Trademuka 
Box PCT 

WMliington, D.C. 20231 
FacsimUc No. a03) 305-3230 



Authorized ofGcer >/\^^^-fl I 

JAMES MARTINELL (J {J 
Telephone No. r703) 308-0196 



Form PCT/ISA/210 (second shect)(Juty 1992>* 



INTERNATIONAL SEARCH REPORT 



International application No. 
PCTAJS95/01160 



C (Continuation). DOCUMENTS CONSIDERED TO BE RELEVANT 

Category* Citation of document, with indication, where appropriate, of the relevant passages Relevant to claim No. 



Y Nucleic Acids Research, Volume 19, No. 25, issued 1991, E. 1-16 
Hara et al, "Subtractive cDNA cloning using oligo(dT)3o-latex and 
PCR: isolation of cDNA clones specific to undifferentiated human 
embryonal carcinoma cells", pages 7097-7104, see entire 

document. 

X Nature Genetics, Volume 2, No. 3, issued November 1992, K. 1, 3 
— Okubo et al, "Large scale cDNA sequencing for analysis of ~ 

Y quantitative and qualitative aspects of gene expression", pages 2 and 4-16 
173-179, see narrative text portion of entire document. 



4 



Form PCT/ISA/210 (continuation of second shcctKJuly 1992)* 



Rhports 



Arflp sequence talowing Ser** oocus ^^Xhn 
the dom^cf Aid ip thai ^iow5 rxxnology with hDE 
(74). To delete the complete STB23 sequence and 
craatethenB23ArLn43rrutation. pot^vnorase chah 
reaction ff»CR) pfTOS (S'-TCGGAAGACCTCAT- 
I C I I GCIO TTTTOTATTCCTC- TGTAGATTG- 
TACTGWiAGTGCAC-3'; and 5'-<3CTACAAACAQC- 
GTCXiACTTGMTGCXXX:GACATCTTCQACTT3T. 
GCGGTArnCACAOCG-3') were taed to aapHy 
the cnA3 saquence d pRS3l6. and the reaction 
pfxxixa was raristormed htoyoast torone-st^ gene 
reptacement (R. Rolhstein, Methods Eraymo/. 194.. 
281 (1991J). To creatB the ax/; t::LBJ2 nuatJon oorv 
t^ned on pi 14, a SiWtb Sd I trag^^rt from 7 
was doned Into pUCi9. and «i intemat 4jO-kb Hpa 
^Mx) I fragment was repts^) wftn a L£U? fragmoiu 
To constnd the sftP3Atajg eB«*® dototion cor- 
. leapondng to .931 amno ackss) carrtad on piS3, e 
LBJZ fragmer« was loed to reptac« the 2.6443 Pm( 
l-E(*136 1 tragmont ol S7E23. wTiich O0CU3 wfthin « 
6l2-W5 WrxJ tt-Be* « penwrtc trBgmen cemed on 
pSP72 (PromegaJ. To aaaa YEpA«RA'. a l.6-*(b 
Bam HI fragment oontaWng MFA /, ^om pKK16 |K, 
Kuc^Tta-. R E Steme. J. T^oner, £MBO JL 8. 3973 
(19B9S,wasfig3ledintotheBamHI fi^teofYEpSSl p. 
E W. A. M. Myeis. T. J. Koemer. A. Tzagolofl. Yeast 
Z 163 (1986gj. 

24. 0. Cham end L Herskowltz, C&0 65, 1203 (1991). 

25. B. W. Matthews, Acc Om. Res. 21. 333 (1988). 

26. K. Ktxfiter. H. G. DoNman. J. Tnofne-; J. Cw Btot 
120. 1203 (1993): R. KoSng and C P. HoOenberB. 
fiMBOa 13, 3261 (1994J; C. Bert<o%w. D. Loayza. 
S. Mk^UBfis. A«. SfaE. Cttt 5. 1 165 09MX 

27. A. Bender and J. R. Prriole, Proc. NstL Acatf. So: 
USA 66, 0976 (19861: J, Chant, K. Corrado, J. R. 
Pringle, t. HersKowta. Cef 65. 1213 (1991); S. 
Powers. E. Gonzates. T. Christensea J. Cubert. D. 
Bfoek. ftwd., p. 1225; K O. Park. J. Chant. 1. Her- 
akowtlz, Warure 365, 269 (1993): J- Chant. Jfends 
Genet 10,328(1994); and J- R PringJe. J. 



ucL PCZ25 b a K5-(> (Stratagene) piasrnd ccntayig 
« 0^ Bam ►*-5st I fragment from p'WL ». Sutwti- 
Wion rruationa Off the proposed active rte ot Ajdlp 
were oaalad wWi the use d pC225 arxJ cile-spoclfc 
nwtagenesis rwoMng appropriate synthetic oigoru- 
cieoDdee \fixtUH6aA. 5'-GTGCTC«iAAAG0GCT- 
GCCAAAOCGGC-3': gxfl-eJIA, 5'-AAGAATCAT- 
GTO0GCACAMGGTG0GO3'; vti ajdl^lD. 5'- 
AAGMTCATGTGATCACAAAGGTGCGC>^1. The 
notations wot oonfrmed by seguaxe maiysis. Af- 
tET rrwagenesis, the 0.4-kb Bam Ht-Msc I fraynot 
Jfwn the muiagenced pC225 ptasrrtos was tWB- 
(er?edintopA)a.J tocreateasetof pRS316p*asrrtcte 
carrying diftarent AXL1 aBeles. pl24 (axfl-HSM), 
pl30 (a)tf7-£7iA). and pi 32 (arfNf7lO). Smtoly. a 
w< of HA-laggod aleloe carried on YEp352 wwB cre- 
ated after raplaoement of the pl5l Ban HMgisc I 
fraoment to generate pi 61 (ax/7-£7)A), pl62 (aitf?. 



32 



N. Davis. T. Favaro. C. da Hoog. S. Wm to 
coniwiis on the manuscript Supported by a 
gram to CB. from the Natural Sciericas vid ^ygn 
neertng Resaaich Comdl of Canada. 
M.N A was from a CaEfomia Tobacco-Atfated Ois- 

22 Ane 1995; abceptad 21 Auguat 1995 



CafBKV. 129. 751 (1995): J. Chant. M. Muchke, E 
MitcheB, L Herskowttz. J. R Pringle. £>ft/.. p. 767. 

28. a F. Spiague Jr., MethooSs. EnzynxV. 194, 77 
(1991). 

29. SiT g to Jot ter abbreviations for the arrvno add resf- 
duas are as (oflows: A, Ata; C. Cys; D. Asp; E. Glu; F. 
Phg; a Gty: H, Hs: I. le: K. Lys; L. Leu: M, Met; N. 
Asn; P, Pro; Q. Gh; a Aig; S. Ser, T. Thr; V, Val; W, 
Trp; and Y. Tyr. 

30. A W303 lA darivBlnie, SYTSZS (MATa i/B3-f ieuS^ 
1l2trp1'lee3eS-1canl'lOOsst1iifnidZ&--:fUSl-bcZ 
/1KS3A.':RJS1 -HSS). WBsIhe pererit stran fry frw fTXJta^ 
searin SY2625 dematww far ihB rnatiig assays, ae- 
oeted pheromone assays, and the ptJs&<tase e^. 
iments hdUaed tw tolOMfr^ sfrsirtf: Y49 {fteZS-l). 
Y115 (fnte7&:±flJEa. Vl42 ^a/J.-rORAgj. Y173 
(axf7A:Xai?). Y220 (aK/JrLTMS srB23A-:tRA3). r221 
(sta23A.*.inA3). >231 M7A:±aJ? 5<a23A^:La^. 
and r233 (sf8e3A-l£U3. MAU denvatwes of 
SY262S IxAjdad the toOow^ strains: Y199 
(5Y2625 made MATo), Y276 (s(a22-7). Yigs 
(rntolAjXa/?). Y196 (arf7A::i£Ug. and Y197 
(ajtfT.TURAd). The EG123 (U4ra totf uraStrpi cent 
h'Ui genetic bacitgro^ was uaad to aeata a set of 
strains for anatysis of bud sfte selection. EG 123 dft- 
rivatKw- incfuded ttw tolowtng strains: Y17S 
(axn&.*:LaJC}, Y223 (ajdT.7CJn4J), Y234 (rtB23Ar 
iH/a vid Y272 ^1A::L£U2 ste23&r:LBJ?l, 
MATo dertvatives of EG123 hdudad (he tolowing 
strains: Y2i4 (EG123 made M4fo) and Y293 
{fixntrLEUZl AS strains were generBied by rneans 
ctf stamard genetic or mdeoiar methods frivoMng 
the appropriata constructs (23). fr> particular, the sjdl 
8te23 double mutant strm were cTBBtad by cross- 
ing of the appropriate AMTa sta23 and MATo axtl 
mutants, fbOowed by spvutatlon d the resiAant dp- 
bid and isolation of the double nwtani from nonpe- 
rertta) 'type tetrads. Gene dteruptlons were oon- 
firmed with either PGR or Soulhem (DNAJ ana^. 

31. p129 b a YEp352 p. E HI. A M Myers, T. J. Ko- 
em», A. Tiagctofl, Yeast 2. 163 (19863) ptasmid con- 
taining a S.54<bS8f I fragrm of pAXLf.plSI was 
derived from pi 29 by irwfftion of i inker at the Bgl I 
sile within AXL 7 , wrtcJited to frvfreme inaorTion of 
the hernaggiitlnin (KA) eiAope |DQnrPttA«nrA) (2^ 
between vnfrK) adds 854 9x1655 of the AJO. 7 prod- 



Quantitative Monitoring of Gene Expression 
Patterns with a Complementary DNA Microarray 

Mark Schena/ Dari Shalon.*t. Ronald W. Davis, 

Patrick O, Browni 

L^inoi* M^^*^^ ^^^^^"^ developed to tnonitor the expression of many genes In 
parallel. M.croarrays prepared by high-speed robotic printing ot complementa^ DN^o^ 
glass were used for quantitative expressioti measurements of the co^es^^ 
Because ot the small fomiat and high density of the arrays. ^^^^^ 

STom f rn> "'"^ ^l^--^^ ^etectten of rare^an^ll^^"^:^^^ 
derived from 2 micrograms of total cellular messenger RNA. DiffererrUal express!^ 
measurements of 45 Arsbidopsis genes were made by means of simult^^^^S^ 
fluorescence hybridization. Twc>-coior 



The remporal, developmental, topographi- 
cal, histological, and physiological pancms 
in which a gene is expressed provide clues lo 
its biological role. The large and expanding 
daubase of complementary DNA (cDNA) 
scquciKCS from many organisms ( i ) presents 
the oppominity of defining these patterns at 
the level of the whole genome. 

For these studies, we used the small flow- 
cring plant Arabidopsis ihahana as a model 
organbm. Arcdtidopiis possesses many ad- 
vantages for gene expression analysis, in- 
cluding the fact that it has the smallest 
genome of any higher eukaryote examined 
to date {21 Forty-five cloned Arabidopsis 
cDNAs (Table 1), including H complete 
sequences and 31 expressed sequence tags 
(ESTs), were used as gene-specific urgets. 
we obtained the ESTs by selecting cDNA 
clones at random from an Arotidopsia 
cDNA library. Sequence analysis revealed 
that 28 of rf^e 31 ESTs matched sequences 

M. Schena end R. W. Davts, Deparvneni of SioctrerTisiry 
Bookman Center. Stanlcrd University Medical Center 
Stanford. CA 94305. USA. 

D^Shalon and P. O. Brown, Department of Biochemistry ' 
and Howard Hughes MecJcaJ Institute. Bockman Center 
Stanford University Medical Cento-. Slsnfcrd, CA 94305 
USA. 



^irae authors contribuied equatty to this work. 
tPresent address: Syntenf. Palo Ado. CA 94303. USA. 
?To whom correspondence should t>e addressed. E- 
mai; pbrown©cmgm.starrto-cj.odu 



SCIENCE • VOL 270 • 20 OCTOBER 1 995 



in the database (Table 1). Three additional 
cDNAs from other organisms served as con- 
trols in the experiments; 

The 48 cDNAs, averaging -1.0 kb 
were amplified with the polymerase chain 
reaction (PGR) and deposited into indi- 
vidual wells of a 96-wcll microtitcr place. 
Each sample was duplicated in two adja- 
cent wells to allow the reproducibility of 
the arraying and hybridization process to 
be tested. Samples from the microtitcr 
plate were printed onto glass microscope 
slides in an area measuring 3.5 mm by 5.5 
mm with the use of a high-speed arraying 
machine (3). The arrays were processed by 
chemical and heat treatment to attach the 
DNA sequences to the glass surfecc and 
denanire them (3). Three anays. printed 
in a single lot. were used for the experi- 
ments here. A single microliter plate of 
PGR products provides sufficient material 
to print at least 500 arrays. 

Ruorescent probes were prepared from 
total Arabidopsis mRNA (4) by a single 
round of reverse transcription (5). The Ara- 
tidopju mRNA was supplemented with hu- 
man acetylcholine receptor (AChR) mRNA 
at a dilution of 1 : 10.000 (w/w) before cDNA 
synthesis, to provide an internal staruiaid for 
calibration (5). The resulting fluorcsccndy 
labeled cDNA mixture was hybridized to an 
array at high stringency (6) and scanned 



467 




with a laser (3). A high-sensidvicy scan gave 
signals that saturated the detector at nearly 
all of the Arabi^opsis target sites (Fig. lA). 
Calibration relative to ihe AChR mRNA 
standard (Fig. lA) escablished a sensitivity 
limit of - 1 :50,000. No detectable hybridiza- 
tion was observed to either the rat glucocor- 
ticoid receptor (Fig. lA) or the yeast TRP4 
(Ftg. lA) targets even at the highest scan- 
ning sensitivity. A moderate-sctuitiviry scan 



1 2 

a *: i 



High sensitivity 
4 5 G r e 9 10 f 1 12 



V L> C. 



* •. • I ■ 



of the same array allowed linear detection of. 
the more abundant transcripts (Fig. IB). 
Quantitation of both scans revealed a range 
of expression levels spanning rfuee orders of 
magnitude for the 45 genes tested (Table 2). 
RNA blots (7) for several genes (Fig. 2) 
com)borated the expression leveb measured 
with the microanay lo wid\in a factor of 5 
(Table 2). 

Differential gerve expression was invcsti- 



6 tlotJeratc sensitivity 

1 2 3 4 5 G 7 B 9 10 11 12 



Sri <?; 



o o ^\ c 



a re 



1 2 



1 :10,000 1 :S0.000 >lSo*^'^™*' 

Expression level (wlw) 



•' 

1:1X100 



a 

1:10,000 



iVild type 
4 5 6 r e 9 10 11 12 

o o 



O O * O O A 



D HAW transgenic 

1 2 34 SGre9l0 1112 

to ■ - 



g o o 



i» •.*.• .f- 



1:100 



E ' Root tissue 

I 2 3 4 S 6 7 e 9 10 11 12 



o o o 



1:1.000 



F Leal tissue 

1 2 3 4 5 6 7 e 9 10 11 12 

a - - . ^ 



.: .a; 



Jj- . iV^ 4>> o o 



c o 



9 - 



M:ZOO 



^ . *V - o. ij;. ;.• 

5 CI . CI 

1.*1.O00 1:10.000 



Fig. 1. Gerw expression fr»onftofBd With the use or cDNA 
PMubocolac^^ 

wim the use of known axxantratioris or hurnan AChR mRNA h 
(enerson the exes mart< the posttiondeachca^ 

with nuoresce^vtebeied cDNA Oerived from wiid-type plants. (B) Same erray as In (A) but^caS^ 
moderate sensitivity. (C and D) A sinpte aoBy was probed with a 1 : 1 mixtire of fluorescerv labeled cONA 
fr^ wad^ plarrts and fesanrmejabeted cDNA from HAT4 -transgenic plants. The single array was 

plants (O and the tesamine fluorescence corresponding to mRNA from HAT4.tr3nsoertc plants fD) (E 
and F) A f^te array was probed wtth a 1:1 rrjixture of ftuoresceirvlabeled cDNA from root tissue and 
Sssamm-iabeled cONA from leaf tissue. The single array was then scanned successivety to delect Ihe 

corresponding to mRNAs expressed in roots (E) and the assamine fluorescence 
correspondtng to mRNAs expressed n ieaves (F). 



468 



SCIENCE • VOL 270 • 20 OCTOBER 1995 



gated with a simultaneous, twa-coior hy- 
bridiaarion scheme, which served to mini- 
miie experimental variation inherent in the 
comparison of independent hybridizations. 
Fluorescent probes were prepared firom two 
mRNA sources with the use of reverse tran- 
scriptase in the presence of fluorescein- and 
Ussamine-Iabcled nucleotide arulogs, re- 
spectively (5). The two probes were then 
ihbced together in equal proportions, hy- 
bridized to a single array, arid scanned scp- 
aiately for fluorescein and lissaminc emis- 
sion after indepciuJent excitation of the two 
fluorophores (3), 

To test whether ovcrexpression of a sin- 
gle gene could be detected in a pool of total 
Aiflbidopjis mRNA, we used a microanay to 
analyze a transgenic line overexpre»sir\g the 
single transcription factor HAT4 (8). Ruo- 
rcscent probes representing mRNA from 
wiid-type and H AT^-transgenic plants were 
labeled with fluorescein and lissamine, re- 
spectively; the two probes were then mixed 
and hybridized to a single array. An intense 
hybridization signal ivas observed at the 
position of the HAT4 cDNA in the lissa- 
mine-specific scan (Fig. ID), but not in the 
fluoresce in-specific scan of the same array 
(Fig. IC). Calibration with AOiR mRNA 
added to the fluorescein arul lissamine 
cDNA synthesis reactions at dilutions of 
1:10,000 (Fig. IC) and 1:100 (Fig. ID), 
respectively, revealed a 50.fold elevation of 
HAT4 mRNA in the traiugcnic line rela- 
tive to its abundance in wild-type plants 
(Table 2). This magnitude of HAT4 over- 
expression matched that inferred from the 
Northern (RNA) analysis within a factor of 
2 (Fig. 2 and Table 2). Expression of all the 
other genes monitored on the array differed 
by less than a factor of 5 between HAT4- 
transgenic and wiid-type plants (Fig I, C 



WM»yp« 



CABf 



HA74 



BOC1 




0.1 0.01 1.0 
mRNA (i&B) 



0.1 0.01 




Human 
AChH 



20 2.0 0.2 
mRNA (ng) 

Rg. 2. Gene expression monitored with RNA 
(Northern) blot analysis. Designeied arrxjints of 
mRNA from wiid-type and HAT4-transger»c 
plants were spotted onto nyton membranes and 
probed with the cDNAs indicated. Purified hunan 
AChR mRIsiA was used tor caIit>ration. ■ 



Reports 



and D, and Tabic 2). Hybridiiation of flu- 
oresce tn-bbcled gtucocorticoid receptor 
cDNA {Fig, IC) and lUsamine-labeled 
TRP4 cDNA (Fig. ID) vcrined the pres- 
ence of the negative control targets and the 
bcic of optical cross talk between the two 
fluorophores. 

To explore a more complex alteration in 
expression panems, we pctformed a second 
two-color hybridization experiment with 
fluorescein* and lissanxine-labcled probes 
prepared from root and leaf mRNA, respec- 
tively. The scanniitg sensitivities for the 
two fluorophores were normalized by 
matching the signals resulting from AChR 



mRNA, which was added to both cDNA 
synthesis reactioru at a dilution of 1:1000 
(Fig. 1, E and F). A comparison of the scans 
revealed widespread differences in gene ex- 
pression between root and leaf tissue (Fig. 1, 
E and F). The mRNA from the light-regu- 
lated CABl gene was -500-fold more abun- 
dant in leaf (Fig. IF) than in root tissue 
(Fig. IE). The expression of 26 other genes 
differed between root and leaf tissue by 
more than a factor of 5 (Fig. 1, E and F). 

The HAT^-iransgenic lincwc cxamiried 
has elongated hypocotyb, early flowering, 
poor germination, and alteried pigmentation 
(8). Although chariges in expression were 



Table 1 . Sequences contained on the cDNIA miaoarrBy. Shown is the position, the known or outative 

Snfi^^A^Sf °^ nicolhamide ad^ 

dmucteotide: ATPase, adenosne triphosphatase; G7P. guarx>sine triphosphate. 



Position 


cONA 


81.2 


AChR 


a3, 4 


EST3 


a5. 6 


EST6 


a7.6 


AAC1 


a9. 10 


Esn2 


all, 12 


EST13 


bn2 


CABI 


b3. 4 


EST17 


b5.6 


GA4 


b7, 8 


EST19 


b9. 10 


GBf'1 


b11. 12 


EST23 


cl.2 


EST29 


C3.4 


GBF'2 


c6.6 


EST34 


C7,B 


EST35 


C9, 10 


EST41 


C11,12 


rGH 


dl.2 


EST42 


d3.4 


EST45 


d5,6 


HATl 


d7.8 


EST46 


d9. 10 


EST49 


d11.12 


HAT2 


el. 2 


HAT4 


e3.4 


EST50 


e5.6 


HATS 


67, 8 


EST51 


e9, 10 


HA722 


ell. 12 


EST52 


11,2 


ESTSg 


i3,A 


KNATI 


f5,6 


E5T60 


f7,8 


EST69 


f9. 10 


PPH1 


111.12 


EST70 


91.2 


EST75 


93.4 


EST7e 


9S.6 


flOCJ 


g7.8 


EST82 


99.10 


EST83 


gn.i2 


EST84 


hi. 2 


EST91 


h3.4 


EST96 


h5,6 


SAfl; 


h7.8 


EST100 


h9, 10 


EST103 


hll. 12 


TRP4 



FuncUon 



Accession 
number 



'Proprietary sequence 



Human AChR 
Actin 

NADH dehydrogenase 
Actin 1 

Unkrx>wn 
Actin 

ChlorophyW a/b bindirQ 
Phosphoglycerate kinase 
Gbbereltic acid bbsynthesis 
Uncrown 

G*bo3t bifKJing factor 1 
Bongation factor 
Aldolase 

G-box birvjing factor 2 
ChloFoptast protease 
Unkrtown 
Catalase 

Rat ghjcocorticoid receptor 
Unknown 
ATPase 

Homeobox-tevxane z^sper 1 
Ught harvesting complex 
Unknown 

Homeobojcteucine zipper 2 
HofT>eobox-teucine zipper 4 
^'t^osphoritxjlolQnase 
Homeobox-teudne zipper 5 
Lhknown 

Homeobox-teucirw zipper 22 
Oxygen evolving 
Unknown 

>Coofr©d-Bke homeobox 1 
RuBisCO small subu* 
Translalion etongalion factor 
Protein phosphatase 1 
UnkncFwn 

Oikxoplast protease 
Unknown 
Cydophfin 
GTP binding 
Unknown 
Unkrx>wn 
Unkr^own 
Unknown 
Syr^aptobrevin 
Ught harvesting comptex 
Ught harvesting complex 
Yeast tryptop han btosynthesis 

of Stratagene M -tote. Catifomia). fNo match h the database: ne>«l EST. 



H36236 

227010 

M20016 

U3B594T 

T45783 

M85150 

T44490 

L37126 

U36595t 
X63894 

X52256 

T04477 

X63895 

R87034 

T14152 

T22720 

Ml 4053 

U35596t 

J04185 

U09332 

T04063 

t 76267 

U09335 

M90394 

T04344 

M90416 

233675 

U09336 

T21749 

234607 

U14174 

XI 4554 

T42799 

U34803 

T44621 

T43698 

R65481 

LI 4844 

X59152 

233795 

145276 . 

T13832 

R54816 

Mg0418 

218205 

X03909 

X04273 



observed for laije change* in ex- 

pression were not obicTved for any of the 
other 44 genes we examined. Thu was 
somewhat surprising, particularly because 
corriparative analysis of leaf and root tissue 
identified 27 differentially expressed genes. 
Analysis of an cxparwdcd set of genes may be 
required to identify genes whose expression 
changes upon HAT4 overexpression; alter- 
natively, a comparison of mRNA popula- 
tions from specific tissues of wild-type and 
HAT4-transgenic plants may allow identi- 
fication of downstream genes. 

At the current density of lobodc printing, 
it b feasible to scale the falwication pro-' 
cess to produce anayi containing 
cDN A targets. At thU density, a single array 
would be sufficient to provide gene-specific 
targets encompassing nearly the erttire rep. 
enoirc of expressed genes in rfie Arofciiofw 
genome (2). The availability of 20,274 ESTs 
from Arafcido;ms (1. 9) would provide a rich 
source of templates for such studies. 

The estimated 100,000 genes in the hu- 
man genome (iO) exceeds die number of 
Arotidopsis genes by a factor of 5 (2). This 
modest increase in complexity suggests that 
similar cDNA microarrays, prepared from 
the rapidly growing repenoire of human 
ESTs (i). could be used to determirtt the 
expression patterns of ter« of thousands of 
human genes in diverse cell types. Coupling 
an amplification strategy to the reverse 
transcription reaction (Ji) could make it 
feasible to monitor expression even in 
minute tissue samples. A wide variety of 
acute and chronic physiological and patho- 
logical conditions might lead to character- 
istic changes in the patterns of gene expres- 
sion in peripheral blood cells or other easily 
sampled tissues. In concert with cDNA mi- 
croarrays for monitoring complex expres- 
sion patterns, these tissues mi^ therefore 
serve as seruitive in vivo sensors for clinical 
diagnosis. Microarrays of cDNAs could thus 
provide a useful link between hurhan gervc 
sequences and clinical medicine. 



Table 2. Gene expression monitofng by frtcroar- 
ray and RhiA blot anafyses; tg. HAT^-transgertc 
See Table 1 tor additional gene Wormatioa Ex- 
pression levels (w/w) were cafibrated with the use 
of known amounts of human ACW mRNA. Values 
tor the microarrBy were determined from microar- 
ray scans (Fig. 1); values tor the RNA biot were 
detenmined from RNA Wots {Rg. 2). 



Gene 



Expression level (w/W) 





MicroarrBy 


RNA blot 


CAB 
CAB/(tg) 
HAJ4 
HAT4 (ig) 
ROCI 
f^OCl (tg) 


1:48 

1:120 

1;83CX) 

1:150 

1:1200 

1560 


1:83 

1:150 

1:6300 

1:210 

1:1800 

1:1300 



SCIENCE . VOL 270 • 20 OCTOBER 1995 



469 




at: 



REFERENCES AND NOTES 

1. The ofTenl EST database (dbEST retease 09149S] 
Iron) the Nstionil Center tor Biotechnology 

tion (Bethasda. MO} contaru e total of 322i>25 en- 
tTtes. hdudhg 2£5,645 from the hurun genome 
vid (rom>«/abbbpsis/Access is evalaUe via 
the World VMde Wab (nnpy/wWw J1cbiJ*TU1i^^}ov)^ 

2. EM.M0ymwbondaEPrutL5aerce229.1214 
(1985); a E and E M. MeywowitL J A4c3£ fibL 
1 87. 169 n 966): I Hwvio er tf^V/^^*^ 1 . 367 (1 991); 
P. Janrfs er aL, PM MoL BioL2< 685 (1994): L Lb 
Gum ef at. Mot Gffi Genec 245. 390 (1994). 

3. O.Shiion.1he»,StaritordUnivor5tty(l99S}: 

end P. 0. Brown, In preparatSon. MicroBiitfye were 
labrtcand on po)yH.*Vs^coated m icro sc upw 
sSdes (Sgrna) with a cintorn-buit arrayng machine 
fined with one prinang tip. The baded 1 |U of pen 
product $iJ& mgM *om microttter platas 

»K) deposited --0X105 }U per ;slde on 40 tfdes at a 
apacirv of 500 Mm. The prirnei^^tdes'were fBhytfrst- 
ed tor 2 txxn b a tunvd chamber, snapKjried at 
100^ tor 1 rria rinsed in 0/11& SDS, and treated 
with 0.05% BUCdric anhydride prepared h bUter 
consisting of 50% 1-mBthy(-2-pyrro6dinone and 
50% boric add. The cONA og the aMes was dana- 
imd in distilad water tor 2 min at 90*0 immecfiately 
before uea Miooarrays wera scanned with a laser 
fluorescent scanrw that contj£r>ed a compuiar-cort- 
troted XY stage erxJ a micrc»c^)pe obfodhn. Arrtced 
gas. nHJtnne laser slowed sequential excitation of 
the two ^jorophoras. Ernttedtgm was 3pSl accord- 
hg to wavelength and detected with two photomJ- 
t|3i0^ tubes. Signate werv rsad into a PC with ths use 
of a 12-bil analogHo-dtglta) board. Addittona) details 
of rriicroerray tabrkBtion arxl use niey be obtained by 
meens of e-nal (ptxuwnOcTngm. slanlord.edu). 

4. F. M.Ausubei efat. Eds,. CunranfArotocoiE^ in Mo- 
lecular Biology (Greene & VMey tmeraciencfl. New 
YorV, 1994). pp. 4^.1-4.3,4. 

5. Polyadenyl3ted(polyiA)*)mRNAv/as prepared Irom 
total RTM wnh the usa of Oigotex-dT resin fOiagen}. 
Reverse trenscnption (RT ) reactiorts were carrtad out 
wtth a StrataScrlpt RT-PCR kK (Stratagena) motfTied 
as toAows: 50-|J reactions contained Oil ^g/|il of 
>^mb«ctopsl^ mRNA, 0.1 ngtpl of hunan AChR 
mRNA. 0X& tkQ/t>i of oUgofdT] (21-nier). lx frst 
straxi buffer. 0.03 U/»J 0) ribcrwdease btocH, 500 
|iM deootyedanoaine iriphaaphale (dAT^, 500 |aM 
deoxyguanosine trtohosphate. 500 ttM tfTTP, 40 
tiM deoacycytosine triphosphate (dCTF), 40 fLM 
orBscen-12-dCTP (or l3saminfr*&-dCT^, enti 0.03 
Ly)kJ of StrataScHpt reverse transcnptase. Reactions 
were incubated tor GO min at 37*C, predpftated with 
ethanot. and rssLspended in 10 »J ofTE (lOmM tris- 
HQ and 1 mM ETTA pH 6.09. Samples were then 
healed tor 3 rrvn at 94*C and cNled on ice. The RMA 
was degraded by addng OJSb fU of 10 N NaOH 
tdowed by a 10^ jncubation at 37*C. The sam- 
ples were neutraiZBd by add^ of 2.5 )U of l M 
tris-a tpH aO) end 0.25 111 of 10 N HQ and prec^ 
Itated with ethanol Peltets were vwashed with 70% 
elhanoi, dried to completion in a speedvac. rasus- 
penoed in 10 (J of H^. end reduced to 3.0 |U In a 
apeedvBC. Ruorescem nudeodde analogs ¥we ob- 
tained Iran New England Mudear (CXjPont). 

6. Hybricfization reactions contained I.OiJ of luorescani 
OONA synthesis produd (5) and IjO ^1 of rvbrldtzstton 
buftor (lOx saAne soAm citrate (SSQ and 02% 
SOS), The 2.0-|fclpfobe mbturas were aSquotad onto 
the rriooarr a y aivtace and covsred wUh cover stipe 
(12 mm rounds Airays were transtorred to a hybrid- 
tzation chamber (3) and hcubetsd tor 18 hours at 
^"C Ansys were washed tor 5 nin at room temper- 
etifB (25*C) to tow-strtngency ws^ buftar (1 x SSC 
and 0.1% SOS), then for 10 min at room tannperetLfB 
inhigh-stringency wash butter (0.1 X SSC end 0.1% 
SOS). Arrays were scanned In 0:1 X SSC with the use 
of a fboescence toser^scarving dovics (?). 

7. Samples ot pdy(A]* mRNA K 5) were spottad onto 
nyton membrvies (Nytran) and crossfinkad with li- 
travtotot Ighl with the use of a Stratainker 18OO 
(Siratagene). Probes were prepared t}y random 
prrnng vvith the use of a Prirne:ti II kii (Siratagene) h 
the presence of f"P)dATP. Hybridiiations wore car- 
ried out according to the instniaions of the manu- 



iactiwr. Ouantrtaiion was performed oi a Phos- 
phorlnroger (Molecular Dynarrtcs). 
B. M. Schena and a W. Davis, ^oc NaO. AcadL Sa 
USA 88. 3894 (1992); M. Schena. A M. Uoyd. a 
W. Devis. Genes Dev. 7. 367 (1993); M. Sdwia aid 
a W. Davis, Pmc NatL Acad. Sci USA 91. 8393 
(1994). . 

9. KHotteefal.«anfj.4.l051(l993);T.Nev«ffnanef 
al.. Plant Pnysiot. 106. 1241 (1994). 

10. N. E Monon, Prxx. NatL Acaa, Sd. USA 88, 7474 
(1991): E D. GrBen and R. H. Watefstcn, J. Am. 
Mad, Assoc 266. 1 966 (1 991): C. Beftffvie-Chante- 
tol, Ce* 70. 1059 (1992): 0. R. Cox ef af.. Sdence 
265. 2031 (1994). 

1 1. E S. Kawasaki ef ai. Pnx, Nao. Acad, Set O&A 
85.5698(1960). 



12. T>ie laser fluorescent scanner was des^wd tobri- 
cated n oolaboration vMilh S. SrT*h of Stailwd IWvw- 
sty. Scanner end anafysis sortwore WB dev^cvied by 
a X )aa The SAdnic arti^ride reoclimwas suggeol- 
od by J. MJigan and J. Van Ness of D»wh fctotobr 
Corporation. Thanks to S. Theologis. C. ScTOvfct K. 
Yamomoto. and menrtoersof the ttbaalcras of aWD. 
and P.03. tor critical oorTments. &jppqr? ft d by t^« 

HcTwarti HughBS Medcaf hstilutB and bygrwite ton 
MH IR2lHQ0045q (P.OB.) vxJ fl37AG00l98 
(R.W.0.11 and from NSF (MCB910e0l 1) (aw.O.) wxj 
by an NSF graduate leMowship ips^ P.03. b an 
assistart investigator of the Howard Hurtws Medkd 
. Instttute. 

1 1 August 1995; accepted 22 Septentw 1995 



Gene Therapy in Peripheral Blood 
Lymphocytes and Bone Marrow for 
ADA Immunodeficient Patients 

Clautjio Bordignon/ Luigi D. Notarangelo, Nadia Nobili, 
Giuliana Ferrari, Giulia Casorati, Paola Panina, Evelina Mazzolari, 
Daniela Maggioni, Claudia Rossi, Paolo Seivida, 
Alberto G. Ugazio, Fulvio Mavilio 

Adenosine deaminase (ADA) deficiency results In severe combined Immunodeficiency 
the first genetic disorder treated by gene therapy. Two different retroviral vectors were 
used to transfer ex vivo the human ADA minigene into bone marrow cells and peripheral 
blood ^mphocytes from two patients undergoing exogenous enzyme replacement ther- 
apy. After 2 years of treatment, long-term survival of T and B lymphocytes, mairow cells 
and granulocytes expressing the transferred ADA gene was demonstrated and resulted 
in normalization of the immurw repertoire and restoration of cellular and humoral immunity 
After discontinuation of treatment, T lymphocytes, derived from transduced peripherai 
blood lymphocytes, were progressively replaced by marrow-derived T ceils in both pa- 
tients. These results indicate successful gene transfer into long-lasting progenflar cells, 
producing a functional multilineage progeny. 



470 



Severe combined immunodeficiency asso- 
ciated with inherited deficiency of ADA 

(1) is usually fatal unless affected children 
are kept in protective isolation or the im- 
mune system is reconstituted by bone mar- 
row transplantation from a human leuko- 
cyte antigen (HLAMdentical sibling donor 

(2) . This is the therapy of choice, although 
it is available only for a minority of patients. 
In recent years, other forms of therapy have 
been developed, including traruplants from 
haploidcntical donors (3, 4), exogenous en- 
zyme replacement (5), and somatic-cell 
gene therapy (6-9). 

We previously reported a preclinical mod- 
el in which ADA gene transfer and expression 

C. Bordlyion. N. NobU. G. Ferrari. 0. Maggtoni. C. Rossi. 
P. Servida. F. Ma\4lo. Tdeihon Gene Therapy Progran) 
tor Genetic Dtseases, dBTT. btitulo Sdentifico H. S. Raf- 
taeto, Mian. Italy. 

L D. Notarangelo. E. Mazzotoi. A G. Ugazio. Depart- 
ment o# Pedialrics. Universily ot Brosda MecJc^ Scfwd. 
Brescia. Italy. 

G. Casoratl. UnrtA tf Immunochimica, DfBTT, IsWuto Sci- 

enlitico H. S. Rafta^. Mten. Italy. 

P. Panina. Roche Milano Ricerche. MU^ Italy. 

* To whom conaspondence shouW be aodressea. 
SCIENCE • VOL 270 • 20 OCTOBER 1995 



succcsshilly restored immune functiotu in hu- 
man ADA-dcficicnt (ADA") peripheral 
blood lymphocytes (PBLs) in immunodefi- 
cient mice in vivo (JO, ]] J. On the of 
these preclinical results, the clinical applica- 
tion of gene therapy for the treatment of 
ADA" SCID (severe combined immunodeTi- 
cicncy disease) patients who previously failed 
exogenous enzyme replacement therapy was 
approved by our Institutional Ethical Com- 
mittees and by the Italian National Commit- 
tee for Bioethia {12). In addition to evaluat- 
ing the safety and efficacy of the gene therapy 
procedure, the aim of the study was to dcTtne 
. the relative role of PBLs and hematopoietic 
stem cells in the lor\g-tcrm icconstitutioQ of 
immune functions after retroviral veaor-me- 
diated ADA gene transfer. For thb purpose, 
two structurally identical vectors expressirtg 
the human ADA complemetxtary DNA 
(cDNA), distinguishable by the presence of 
alternative rcsniction sites in a nonfuTKtional 
region of the viral lor\g.tcrminal repeat 
were used to traruduce PBLs and bone 
marrow (BM) cells independently. This pro- 
cedure allowed identification of origin of 



Reference 3 of 20 

with Response dated 05/04/04 

In USSN: 09/857,826 



Rkports 



Axnp sequence loiCMfng Ser^ and ooon within 
thedomaholAidlptfiat^iows honT ologywithhDE 
(74). To delete Ihe canvM STB23 aequenoa and 
craatethasts23A::ijn43rTutation. potymerasGCtuin 
reaction (PGR) primers (5'-TCGGAAQaCCTCAT- 
TCTTTCTCATTTT&kTATTCCTO- TGTAC5ATTB- 
TACT&«iAGTGCAC-3': wxJ 5'-GCTACAMCAQC- 

gtcgacttgaatgcoccoi^catcttcqactgt- 

GCX5GTATT7CACACCG-3') were woJ to ampWy 
the URA3 sequence of pRS3l6. and Ihe reaction 
product was tanslormedlrto yeast for one-step gene 
lepbcemei* (R. Ftothsteia A^MTtods Emymol, 104. 
281 (19911}. Tocreatethetftfr A:.-L£U?rajtBt)an con- 
tained on pi 14. a SIHtb Sd I trsgnnert from pA)! y 
was doned Into pUCi9. axi tfi Intemaf 4iH(t} Hpa 
Mho I tragmem was replaoed wtth a LH/2 fragmoit 
To constnxa the ife?.^/i-iajg affeto (a deletion cor^ 
n&pondino to 931 amiio ecads) carrted on p1S3. e 
LEU2 fragmert was ised to replace the 2.8-M> Ptnt 
I-Eci136 I fragment d S7E23. which ooais within « 
&24(b hfirv) n-B^ B gen"*: fragment carried on 
pSP72 (Phxnega). To create YEpMfAl, a 1.6-kb 
Bam HI tragmert containing MFA1, *cm pKKi6 |K, 
Kucrter. R E Steme. J. Ttoner» fiWeO JL 8, 3973 
a989S,wasB9atedintotheBamHlsitecrf\Ep351 p. 
E. Ha. A. M. Myers. T.J. Koemer. A. Tzagotaff. Vtetf 
2.163(19869). 

24. J. Chant ^ L Herstowttz. 65. 1203 (i99l). 

25. 8. W. Matthews. AXL Ctm Res, 21. 333 (1968). 

26. K. Kuchler. H. G. DoMman, J. Thomw; J. CetB/oL 
120. 1203 (1993); R Koing end C R Hoflentje^B, 
B^mOJ, 13. 3261 (1994); C. BerKower. D. Loayza. 
S. Mktiaai$.A«c^ BiotCtf S. 1185(1994). 

27. A. BerKler and J. R. Prkigle. Proc. NatL Acad. ScL . 
USA 86. 9976 (1989); J. Chant. K. CorrBdo. J. R. 
Pringle, I. HersKowitz. Cet 65. I2l3 (1991); S. 
Powers, E Gonzales. T. Christensea J. CulMrt, O. 
Broek /bW.. p. 1225: K O. Park. X Chant. I. Her- 
skowltz. Mature 365. 269 (1993); J- Chant. Trends 
Genet 1 0, 328 (1994); arKi J. R Pringte. J, 



ucL PC225 Is a KS-K (Stratagenei) ptasrrid contv^ig 
a 034* Bam H^Sst I fragmert trom pA>a.l. Substi- 
tuiion rruationa of the proposed actwB site ol A)dlp 
were created wfth the use of pC225 ffvl ^specific 
rrwtaganesis rwoMng appropriate synthetic oigonu- 
cteotidea to«fI-W6BA. 5'-GTC5CTO^CAAAGCGCT- 
GCX:aAACCQGC.3': axil'BTiA, S'-AAGAATCAT- 
GTGCGCA CAAAG GTGCGCV3'; vtS wd1-€7W. 5'- 
AAGMTCATGTGATCACAAAGCSTG0GC>3'). The 
nxitaUxv were confmed by sequence viaVsis. 
ter rrutagenests. the 0.4-lcb Bam H)-Msc I tragmtfH 
from the mjtagenized pC225 ptasrrtds was iwis- 
ferrod into pAJO. J to oeate a set of pRS31 6 ptaartds 
canyrg diterent AXL1 aBeles. pi 24 (a«^7-HSS4). 
piX ifijSi-€7}A\ and pi 32 (ajrf^-E/lC). Smtoty. a 
set of KA-tagged aletas carried on Y^j352 w»B oe- 
aied after reptacement of the pl5l Bot Hft-Msc I 
hagment. to generate pi 61 (a)rff-£7J/^. pl62 (ax/?- 



32 



T. Favero, C. de Hoog. and S. Km tt^ 
comments on the marxtscrlpL Supported by a 
grant to C.B. from the Natural Sciences and &igi. 
neering Research Comcil of C^anada. Suppot lor 

M.N A was from a CaStomia Tohacco-Retalod Dis- 
ease Research Program postdoctoral tadowsho 

(4FT.0083). 

22 .Kjne 1995: accepted 21 August 1995 



CeffSio^. 129. 751 (1995); J. Chant. M. Mischke, E 
MilcheO. L Herskowttz, J. R Pringle. p. 767. 

28. a F. Sprague Jr.. Methodls. EnzymoL 194. 77 
(1991). 

29. Single lot tcr abbreviations for the amino acid rasi- 
duBs are as follows: A. Ala: C. Cys; O. Asp; £. Glu: F. 
Phe; a Gly; K He: I. Be: K. Lys; L. Uu: M. Met: N. 
Asn; P. Pio; Q. Gh; R. A/g; S. Ser; T. Thn V, VaJ; W, 
Trp: and Y, Tyr. 

30. A yV303 1A derivBtwe, SY2825 {fMTm tfB3-f Jau2-a 

/its3d.".atSl-NfS3). wBStheptfant strair>tor therrutant 
search. SY2e25 derivBtiMs tor the mating assays, ae- 
oeted pheromone assays, and the pUte-chase «pg- 
iments ixluded tie tolowlr^ sfrsins: Y49 0ta2?-7}, 
yn5 ifl)talt:XBJZi. Y142 >*/7.-:lJ5Fl43). Y173 
(aKf7A:i3JC). Y220 (ax/7iun43 steS^trURASi, r221 
(sts23d:X;AA3). YZ31 M7A::l£U? 5f823A:m^. 
and r233 ^stsBZhrXBS^ MATa 6Bm3ANm of 
SY2625 IrvAjdsd the IbOowing strains: Y199 
(SY2625 made MATa), Y278 (r(s22-7). Vl95 
(mfef&jitfl/?). Y196 tBJdlA:.-i£Ua and V197 
(axn;.*Lm3). The EG 1 23 (MATa to(^ ura3 trplcani 
h's4} genetic badcground was used to aeets a sat of 
strains for analysts o( bud sllfl salBCtion. EG1 23 dft- 
rtvatiwas hduded the folowing strains: Y175 
(ajrfIA.7LfU2), Y223 (a)tff:.-0«43), Y234 trfe23Ai- 
LEUZl and Y272 fpxt1A::LBJ2 ste23&::L£m, 
MATa derivatives of E6123 incfcxled the foiovmo 
strains: y2l4 (EG123 made MATa) and Y293 
ifiJdl^-rLBVSj, AS strains were generated by means 
of standsrd genetic or rrufectiar methods Involving 
the appropri a te constructs (23). In particular, the axfT 
sle23 double mutant strains were creeled by cross- 
ing of the appropriate AMTa sfa23 and M^Ta wdl 
mutants, foliowed by sporutatlon cf the resuftant dQ>* 
loid and isolation of the double motant from r^npe- 
rental cS-type tetrads. Gene dteruptlons were oorv 
firmed with either PGR or Southern (DMA) analysis. 
31. p129 is a YEp352 (J. E Hi, A. M. Myers. T. J. Ko- 
emer, A. Tzagolofl, YeasrS. I63(1966))plasmidoon- 
tainirtg a 5.5-kb Stf I Iragrr^ of pAXLf. pISI was 
derived torn p129 by irvaton of » inker at the Bgl B 
site VMthh AXL r , Mhich ted to an irvtama insertion of 
the hemagglutinin (HA) eiAope (DOrPtXTVPOYAJ (29) 
between ambx) acids 854 tfidSSSol the A>1 7 prod- 



Quantitative Monitoring of Gene Expression 
Patterns with a Complementary DNA Microarray 

* 

Mark Schena,* Dari Shalon,*t RonaW VV. Davis, 

Patrick O. Brownt 

A high-capadty system was developed to monitor the expression of many genes In 
parallel. Microan-ays prepared by high-speed robotic printing of complementary DMAs on 
glass were used for quantitative expression measurements of the corresponding genes. 
Because of the small format and high density of the an^ys, hybridization volumes of 2 
microliters could be used that enabled detection of rare transcripts m probe mixtures 
derived from 2 micrograms of total cellular messenger RNA. Differential expression 
measurements of 45 Arabidopsis genes were made by means of simultaneous two-color 
fluorescence hybridization. 



The temporal, developmental, topographi- 
cal, histological, and physiological paaems 
in which a gene is expressed provide clues to 
its biological role. The large and expanding 
database of complementary DNA (cDNA) 
sequences from many organisms ( I ) presents 
the opportunity of defining these patterns at 
the level of the whole genome. 

For these studies, we used the small flow- 
ering plant Arabidopsis thaliana as a model 
organisrtL Arabidopiis possesses many ad- 
vantages for gene expression analysis, in- 
cluding the feci that it has the smallest 
genome of any higher eukaryotc examined 
to date (2). Forty-five cloned Arabidopsis 
cDNAs (Table 1). including 14 complete 
sequences and 31 expressed sequence tags 
(ESTs), were used as gene -specific urgets. 
Wc obuined the ESTs by selecting cDNA 
clones at random from an Arabidopsis 
cDNA library. Sequence analysis revealed 
thai 28 of the 31 ESTs matched sequences 

M, Schena and R. W, Davis. Deoanmem ol Bioctvenistry. 
Beckman Center, Slankxti tJnivorsity Medical Cent* 
Stanford. OA 94305. USA. 

0. Shaion and P. 0. Brown, Department of Biochemistiy ' 
and Howard Ht^hes Medcal InsUlule. Beckman Cento-, 
Stanford University N4edical Comer. Stanfo-d. CA 94305 
USA. 

•These authors contrtouted equaly to this vwork. 
tPreseni address: Syntanl. Pato Alio. CA 94303. USA. 
♦To wtwm correspondence should t>e addressed. E- 
mai: pbrownOcmgm.stantord.edu 



in the database (Table 1 ). Three additional 
cDNAs from other organisms served as con- 
trols in the experiments. 

The 48 cDNAs, averaging -1.0 kb, 
were amplified with the polymerase chain 
reaction (PCR) and deposited into indi- 
vidual wells of a 96.well microtiter plate. 
Each sample was duplicated in two adja- 
cent wells to allow the reproducibility of 
the arraying and hybridization process to 
be tested. Samples from the microtiter 
plate were printed onto glass microscope 
slides in an area measuring 3.5 mm by 5.5 
mm with the use of a high-speed arraying 
machine (3). The arrays were processed by 
chemical and heat treatment to attach the 
DNA sequences to the glass surface and 
denature them (3). Three arrays, primed 
in a single lot. were used for the experi- 
ments here. A single microtiter plate of 
PCR products provides sufficient material 
to print at least 5CX) arrays. 

Fluorescent probes were prepared from 
total Aroiridopsts mRNA (4) by a single 
round of reverse transcription (5). The Ara- 
bidopsis mRNA was supplemented with hu- 
man acetylcholine receptor (AChR) mRNA 
at a dilution of 1 : 10,000 (w/w) before cDNA 
synthesis, to provide an internal standard for 
calibration (5). The resulting fluorescendy 
labeled cDNA mixture was hybridiicd to an 
array at high stringerKy (6) and scanned 



SCIENCE • VOL 270 • 20 CXTTOBER 1995 



467 




with a laser (3). A hi^-sensidvicy scan gave 
signab that saturated the detector at nearly 
all of the Aratidopsis target sites (Fig. lA). 
Calibration relative to die AChR mRNA 
standard (Fig. lA) established a sensitivity 
limit of - 1 : 50,000. No detectable hybridixa- 
tton was observed to either the rat glucocor- 
ticoid receptor (Fig. lA) or the yeast TRP4 
(Fig. lA) targets even at the highest scan- 
ning sensitivity. A moderate-sensitivity scan 



High sonsilivity 
1 2 345 ere 9 to t1 12 

a ^ <' ^ i> S-' a. ' • ; * 



of the same array allowed linear deicaion of 
the more abundant transcripts (Fig. IB). 
Quantitation of both scans revealed a range 
of expression levels spanning three orders of 
magnitude for the 45 genes tested (Table 2). 
RNA blots (7) for several genes (Fig. 2) 
corroborated the expression levels measured 
with the microarray to within a factor of 5 
(Table 2). 

Differential gene expression was investi- 



B Moderate sensitivity 

1 2 3 4 5 6 7 e S 10 11 12 

0 c c • .■ 



ti *^ C* i : • •^^ 



^ C .. v.- 



• ' V' O :^ 



- •^i 



ti -I : • . h 

>t:3,O00 1:10.000 1:50.000 >1:200 

Expression level (w/w) 



L3 O ' 
1:1.000 



□ 
1:10.000 



Q Wild lirpc 

1 2 3 4 S 6 7 e 9 10 II 12 
a O O 



O O ' .rt a O 



D H>irj transgenic 

1 2 3 4 5 6 7 6 9 to 11 12 

b -• ■ *. 



GOO 



o c 



f 0 

g o o 



1:100 



1:1.000 



E ' Root tissue 

1 2 3 4 S 6 7 B 9 10 II 12 



O O O 



F Leal tissue 

1 2 3 4 5 6 7 e 0 10 11 12 



- tT' <Qi O O 



o o 



>-1:200 



3 n : 

1:1.000 



9 ./ . C - • 

h , C ..r f. i:. it;-. ;.f .y. 

(0 1:10.000 



Rfl. 1 , Gene expression monrtored with the use of cDNA rnksoarrBys. Ruoresoeni scans represented in 

fBoudocokjr correspond to hybrW 

with the use of known concenlrattons of hurnan AC^ 

teners on the axes rnarkthe posttton of each cDNA. (A) Hgh-se^^ 

wim ftuoresc^f>4at)eied cDNA derived from wild-type planis. (B) Same array as m (A) but scanned at 
moderate sensitivfty. (C and 0) A skigto array was probed witha 1 : 1 mixture of fluorescervlatseled cONA 
from wfld-type plants and lissamine-iabeted cDNA from HAT4. transgenic piants. The single array 

then scanned suoce6siv©»y to detect the fluorescein fkiorescence corresponcfing to mR^ 
plants (Q and the Bssamine fluofescenoe corresponding to mRNA from HAT4.irans9ertc plants O (E 
and F) A single arraywas probed with a 1:1 mixture of ftuorescetn-labeled cONA from root tissue and 
jissamtne-labeled cDNA from leaf tissue. The single array was then scanned successively to delect ihe 
fluorescein florescence corresponding to mRNAs expressed in roots (E) and the assamine fluorescence 
conespondtng to mRNAs expressed in leaves (F). 



gated with a simultaneous, twcxobr hy- 
bridization scheme, which served to miiu- 
mize experimental variation inherent in die 
comparison of independent hybridizations. 
Fluorescent probes were prepared from two 
mRNA sources with the use of reverse tran- 
scriptase in the presence of fluorescein- and 
lissamine-labelcd nucleotide analogs, re- 
spectively (5). The two probes were then 
mixed together in equal proportions, hy- 
bridized to a single array, and scanned sep- 
arately for fluorescein and liisaminc emis- 
sion after independent excitation of the two 
fluorophores (3). 

To test whether overexpression of a sin- 
gle gene could be detected in a pool of total 
ATobidopsis mRNA, we used a microarray to 
analyze a transgenic line overcxpressixig the 
single transcription factor HAT^ (8). Fluo- 
rescent probes representirig mRNA from 
wild-type and HAT^-transgenic plants were 
labeled with fluorescein and iissamine, re- 
spectively; the two probes were then mixed 
and hybridized to a single array. An intense 
hybridization signal was observed at the 
position of the HAT4 cDNA in the lissa- 
mine-specific scan (Fig. ID), but not in the 
fluorescein-specific scan of the same array 
(Fig. IC). Calibration with AChR mRNA 
added to the fluorescein and Iissamine 
cDNA synthesis reactions at dilutions of 
1:10,(X)0 (Fig. IC) and 1:100 (Fig. ID), 
respectively, revealed a 50-fold elevation of 
HAT4 mRNA in the traiugenic line rela- 
tive to its abundance in wild-type plants 
(Table 2). This magnitude of HAT4 over- 
expression matched that inferred from the 
Northern (RNA) arwlysU within a factor of 
2 (Fig. 2 and Table 2). Expression of all the 
other genes monitored on the array differed 
by less than a factor of 5 between HAT4- 
transgenic and wild-type plants (Fig 1, C 



WSdtype 



CABI 



HAT4 



ROC1 




0.1 0.01 1.0 



ai 0.01 




Human 
AChR 



20 2.0 0.2 
mnHA(ng) 

Rg, 2. Gene expression mor^ltored with RNA 
(Northern) blot anaJysis. Designated emomts of 
mRNA from wild-type and H4T4.tfansgenic 
plants were spotted onto nylon membranes and 
probed with the cONAs indicated. Purtfied hunan 
AChR mRNA was used (or calibration. ■ 



468 



SCIENCE • VOL 270 • 20 OCTOBER 1995 




REPORTS 



and D, and Table 2). Hybridization of flu- 
oresce in-labeled glucocontcoid receptor 
cDNA (Fig. IC) and lissamine-labeled 
TRP4 cDNA (Fig. ID) vcrined the pres- 
ence of the negative control targets and the 
lack of optical cross ulk between the two 
fluorophorcs. 

To explore a more complex alteration in 
expression patterns, we performed a second 
two-color hybridization experiment with 
fluorescein- and lissamine-labeled probes 
prepared from root arui leaf mRNA, respec- 
tively. The scanning sensitivities for the 
two fluorophorcs were normalized by 
matching die signals resulting from AChR 



mRNA, which was added to both cDNA 
synthesis reactions at a dilution of 1:1000 
(Fig- 1. E and F). A comparison of the scans 
revealed widespread differences in gene ex- 
pression between root and leaf tissue (Fig. 1, 
E and F). The mRNA from the light-regu- 
lated CAB/ gene was -500-fold more abun- 
dant in leaf (Fig. IF) than in root tissue 
(Fig. IE). The expression of 26 other genes 
differed between root and leaf tissue by 
more than a factor of 5 (Fig. 1, E and F). 

The HAT4-transgenic line we cxamiried 
has elongated hypocotyb, early flowering, 
poor germination, and altered pigmentation 
(8). Although changes in expression were 



Table 1 . Sequences contained on the cDNA microarray. Shown is the posttion. the known or Dutativ** 
function.arxjtheaccessionnixnberofeachcDr^lntheJT^^ IMSt^^e^K^^ 

diHucteobde: ATPase. adenosine tnphosphatase; GTP, guanosine triphosphate. aoenne 



Position 


cONA 


81. £ 


AChR 


aQ A 


EST3 


aK C 


cSTo 


a7 Q 


AAC1 


89, 10 


EST12 


all, 12 


EST13 


01, 2 


CABl 




coll 7 


ft 

Do, D 


GA4 


Of, o 


EST19 


b9. 10 


GBf'1 


bii. 12 


EST23 


c1,2 


EST29 


c3.4 


GBF-2 


05. 6 


EST34 


c7. 8 


EST35 


c9. 10 


EST41 


C11. 12 


rGR 


d1.2 


EST42 


d3.4 


EST45 


d5.6 


HAT1 


d7,8 


EST46 


d9. 10 


EST49 


d11.12 


HAT2 


el. 2 


HAT4 


e3,4 


EST50 


e5.6 


H475 


e7,B 


EST51 


69.10 


HA722 


ell. 12 


EST52 


t1.2 


EST59 


13.4 


KNATI 


t5,6 




(7,8 


EST69 


19. 10 


PPH1 


111,12 


EST70 


9I.2 


EST75 


93,4 


EST78 


95.6 


flcx:7 


97.8 


EST82 


99. 10 


Esrsa 


911. 12 


EST84 


hi. 2 


EST91 


h3.4 


EST96 


h5.6 


SARI 


h7.8 


ESTIOO 


h9. 10 


EST103 


hn.l2 


TRP4 



Function 



Accession 
nurrtber 



Human AChR 
Actin 

NADH dehydrogenase 

Adin 1 

Unknown 

Actin 

Chlorophyll a/b birtdirQ 
PhosphogJycerate kinase 
Gibbereflic add biosynthesis 
UnJnown 

G-box binding factor 1 
Elongation factor 
Aldolase 

G-box binding factor 2 
Chloroptasi protease 
Unknown 
Catalase 

Rat gtucocorticoid receptor 

Unknown 

ATPase 

Homeobox-leucine zipper 1 
tight harvesting complex 
Unknown 

Homeobox-leucine ripper 2 
Homeobox-leucine zipper 4 
Pbosphortoutokinase 
Homeobox-leucine zipper 5 
Unknown 

Homeobox-leueine zipper 22 
Oxygen evoMng 
Unknown 

Krxjfred-Ske homeobox 1 
RuBisCO snteflsubunil 
Translation elongation factor 
Protein phosphatase 1 
Unloiown 

Chioroptast protease 
Unkriown 
CydophiSn 
GTP bindir>g 
Unknown 
Unknown 
Unkrxjwn 
Unknown 
Syr^tobrevin 
Ughl harvesting complex 
Light harvesting complex 
Yeast tryptophan bio&^thesis 



•Proprialary sequence of Strataeene M Jote. Calitomia). Vno match in the database: novel EST 



H36236 

227010 

M20016 

U36S94t 

T45783 

M85150 

T44490 

L37126 

U3S595t 
X63894 

'X52256 

T04477 

X63895 

R87034 

T14152 

T22720 

Ml 4053 

U3B596t 

J041B5 

U09332 

T04063 

t76267 

U09335 

M90394 

T04344 

M90416 

233675 

U09336 

T21749 

234607 
U14174 

XI 4564 

T42799 

U34803 

T44621 

T43698 

R65481 

LI 4844 

X59152 

233795 

T45276 

Tl 3832 

R64816 

M90418 

218205 

X03909 

X04273 



observed for HAT4, laige chances In ex- 
pression were not obsaved for any of the 
other 44 genes we examined. Thia was 
somewhat surprising, particularly because 
comparative analysts of leaf ard root tissue 
identified 27 dtflfcrcntially expressed genes. 
Analysis of an exparuled set of genes may be 
required to identify genes who«c expression 
changes upon HAT4 ovcrexpression; alter- 
natively, a comparison of inRNA popula- 
tions from specific tissues of wild-type and 
HAT4-transgcnic plants may allow identi- 
fication of downstream genes. 

At the current density of robotic printing, 
it is feasible to scale up the fabrication pio-* 
cess to produce arrays containir^ 20,000 
cDNA targets. At diU density, a sirtglc array 
would be sufficient to provide gene^pedfic 
targets encompassing neariy die eruirc rcp^ 
crtoirc of expressed genes in the Aratiicfitsii 
genome (2). The availability of 20,274 ESTs 
from Arabidopsis (J . 9) would provide a rich 
source of templates for such studies. 

The estimated 100^ genes in the hu- 
man genome (10) exceeds the number of 
Arabidopsis genes by a factor of 5 (2). This 
rnodest increase in complexity suggests that 
similar cDNA mtcroanays, prepared from 
the rapidly growing repertoire of human 
ESTs (]), could be used to detcnnine the 
expression patterns of tens of thousands of 
human genes in diverse cell types. Coupling 
an amplification strategy to the reverse 
trariscription reaction (ii) could make it 
feasible to monitor expression even in 
minute tissue samples. A wide variety of 
acute and chronic physiological and patho- 
logical conditioru might lead to character- 
istic changes in the patterns of gene expres- 
sion in peripheral blood cells or other easily 
sampled tissues. In concert widi cDNA mi- 
croarrays for monitoring complex expres- 
sion patterns, these tissues mi^t therefore 
serve as sensitive in vivo sensors for clinical 
diagnosis. M icroarrays of cDNAs could thus 
provide a useful link between huinan gerw 
sequences and clinical medicine. 



Table 2. Gene expression monitoring by microar. 
ray and RNA blot analyses; tg, HAr4-transger*c 
See Table 1 for acWItional gene intorniatlon. Ex- 
pression teveis (w/ww) were crfibrated with the use 
of known amounts of human AChR mRNA. Values 
lof the micfoanay were detemiined from nicroar- 
ray scans (Rg. 1); vahjes for the RNA btot were 
determined from RNA blots (Rg. 2). 



Gene 



Expression level (wAv) 



CAB 
CABiHg) 
HAT4 
HAT4 (tg) 
ROCI 
ftOC7 (tg) 



MicroanBy 


RNA btot 


1:48 


1:83 


1:120 


1:150 


1:8300 


1:6300 


1:150 


1:210 


1:1200 


1:1800 


1260 


1:1300 



SCIENCE . VOL 270 • 20 OCTOBER 1995 



469 




REFERENCES AND NOTES 



1. Tbe ofrent EST database (dbEST reloase 09149S) 
fcofw the National CBfrtcr fcy Bcte ci inQtogy Wo rma- 
lian{Bem0sda. MO) contBhs a total or 322^25 en> 
tries. Indudhg 255.645 from the hunan genome 
tni 21.044 tmAnbkiofuis, Accms a svalaUe via 
the World VMde Wotj (httpy/wwwjcbl J*anih.gov). 

2. EM.M0yerowjtzandaEPniftI.5otence229.12l4 
(1085): a E Pmitl and E M. Meyerowitz. J A«]£ fioL 
1B7. 16g{ifie6);L Hwang erar..AanrJL 1.367(1991): 
P. Janrb e( al. /%rtf MoL BM 24, 686 (1094): L La 
Guen er aL, Mot Gaa Genet 245. 390 (1994). 

3. a Sh^thesb. Stanford Unnmsfty(l99S): 

and P. O. Brown. In preparatloa MicroerTays were 
tabffcatod on poty-L-Vsine-coated miooscope 
sides (Sgrna^ with a custorn-tauat arraying machine 
fined with one printing tip. The lip loaded 1 »Jof PGR 
product (0^ mgAnO from oe-waO rncrotltar ptatas 
and depositad -0.006 »J per side on 40 sSdas at a 
spacing or 500 (im. The printad sldes^rera rahycfrst- 
ad tor 2 hOLVS m a humid chamtMr. snap-dried at 
100*C lor 1 min. rinsed in 0.1% SDS. and treated 
with 0.05% aucdnic anhydride preparad h tiUler 
consist^ of 50% l-methyl-2-pyrrafidirione and 
50% boric add. The cONA ontte sSdes was dena- 
tured in distOed water tor 2 min at 90*C tmmet^ety 
betOTB usa Mlcroarrays were scanned with a laser 
fluoca sc ant scanner that corAained a oomputar-corv 
troled XY stage arKl a microacQpe objecOva. A mixed 
gas. mUttlne laser aBowad sequential excitation of 
the two fluorophores. Emitted ight was spst aocord- 
^ to wavelength and detected with two photomi- 
t^Jtar tubes. Signais were read irrto a PC with the use 
ol a 13-bit anaiog-to-cfigttal board. Additional details 
ol microerray fabrication ery] use may be obtained by 
means of e-md (tA)rcMffidcmgm. slanford.edu). 

4. F.M.AuBUbelera£,Eds..Cur7ar)f ArofocoftinMo- 
hcuiar Biobgy (Greene & WUey frtteracierKe. New 
YOfV. 1994). pp. 43.1-4.3.4. 

5. Poiyadanyiated(poly(A)^]mRNA was prepared from 
total RNAv^tiw use of OfigotexKfT resan (OiagenJ. 
Reverse transcription (RT) reactiorts were earned out 
u«h a StrataScrtpt RT-PCR kit (Stratagene) modUied 
as fdbws: SO-fJ reactions contair>ed Oil |Lg/|il of 
Amt3iaapsis mRNA. 0.1 ng/^J of hunan AChR 
mRf^ 0.06 »ig/fJ of oiigo(dT) t21'maf). ix frst 
strand buffer. 0.03 IV»J of rtoonudease block, 500 
iiM deoacyadenosina tr^ihoaphate (dATP). SOQ iJA 
deoxyguanosine triphosphate. 500 iM dTTP, 40 

daootycyt^ne triphosphate (dC TF^. 40 iM flu- 
oresc6in-l2-dCTP (or Rssam^S-dCTP). ar^ 0.03 
U/fii of StmtaScriptraverGe transcriptasa. Raactione 
were incubated for 60 min at 37*0, predpitatad with 
ethanoCandrasuspandedlnlOMiorTE (lOmMtris- 
HQ and 1 mM H7TA pH BjO). Samples were then 
healed tor 3 nnin at 94*C and chiled on ice. The RNA 
was degraded by addng 025 |)J of 10 N NaOH 
toiowed by a lO^nh ncubation at 37*C. The sarr>- 
ples were neutratzad t>y ad(£tion of 2.5 of 1 M 
tris-a tPH aO| and 0.25 |J of 10 N HQ and prec^ 
itated with ethanoL Pdlets were washed with 70% 
ethand, dried to miipi mi ui in a speedvac. resus- 
pended in 10 tiJ of H^O. and reduced to 3.0 ^ h a 
speedvac nuorescenl nudaodde analogs ware ob- 
tained from New England Mudear (DuPont). 

6. Hytvidzation reactions contained 1.0(U of luorascent 
cONA synthesis produd (5) and 1 jO |J of ^VbrkftEBtlon 
buffer [lOx saAne sodvn dtrate (SSQ and 02% 
SOS). The 2IH&1 probe mktures were aiquotad onto 
the rrfciu ei i a y surtace md covsred with cover sfipa 
(12 mm roundH. Airays imra transferred to a hytjrid- 
ization dnmber (?) and incubated tor 18 hours at 
65'C Arrays were washed tar 5 rnn at room iBfTper- 
atue (25*Q In Inv-strtigency wash buftar {1 x SSC 
and 0.1 % SOS), then fori 0 min at room tsmperaturB 
in high-stnngency wash t>ufter (0.1 x SSC and Ol% 
SOS). Arrays were scanned h Ol X SSC with the use 
of a fluorescence laser-scanrvng dwice (3). 

7. Samples of pdytA)* mRNA K 5} were spotted onto 
rfyton membranes (Nytran) and croseTrked with ii- 
travldet ight with the use of a Stratainker 1800 
(Stratagene). Probes were prepared by rarxsom 
prirring with the useof a Prime-tt n ka (Stratagene) in 
the presence of pPjdATP. Hybridizalions wore car- 
rSed out according to the irtstrudbns of the manu- 



tactirer. Quanttiation was performed on a Phos- 
phorlrrager (Molecular Dynamics). 

8. M. Schena and R. W. Davis, fhx. NatL Acaa. ScL 

USA 80, 3894 (1092); M. Schena. A. M. Uoya a 
W. Davis. Genes D&^. 7. 367 (1993); M. Schena axJ 
a W. Davis. Rfoc AiaflL Acad. Sd USLA 01, 8393 
(1994). 

9. KHoHeefail,rtanfj.4.ia51(i993);T.fvtewmanaf 
al.. Plant PhysioL 106. 1241 (1934). 

10 N. E Monon. Proc. NatL Acad. Sd. aS-4. 86, 7474 
(1991): E D. Green and R. K Waterstcn. J. Am, 
MooLAssoc 266. 1966 (1991): C. Bdm>e*Chante- 
tot. Ca« 70. 1059 (1992); D. R. Cox ef rf.. Science 
265, 2031 (1994). 

1 1. E S Kawasaki ef al. Proc Nao. Acad, ScL U.S>A 
65. 5698 (1066). 



12. The bser (k JM tfeLw 11 scanner was c toLk fM>i«v< 
caiod h cdaboration with a Smith d Stffitorti IWw- 
sHy. Scamer and analysis software wQsdeveicyjed by 
a X Ma. The sucdric artiydride reoction was suggest- 
ed by J. MiAgan and J. Van r4a3s of Dvwn Matocutar 
Corporation. Thsnks to SThedogis, C. SOTwvtet K, 
Yamanroto. and members of the tt}vatcyies d aWD. 
and P.O.B. tor critical corrvnerits. Supported by lf« 
Hcrwaid Hugnes Medcal Insotuto and by gr»5 fron 
MM (R21HQ0045C5 (P.03.) vd R37AG00196 
(R.W.D.1J and from KBF (MCBOI 0601 1) (aW.O.) »xJ 
by an r>iSF graduate fellowship (ps.). P.O.a an 
assistant irwestigator of the Howard Hu^ies Ivledcd 
Institute. 

1 1 August 1995; accepted 22 Saptembw 1995 



Gene Therapy In Peripheral Blood 
Lymphocytes and Bone Marrow for 
ADA" Immunodeficient Patients 

Claudio Bordignon.* Luigi D. Notarangelo, Nadia Nobili, 
Giuliana Ferrari, Giulia Casorati, Paola Panina. Evelina Mazzolari, 
Daniela Maggioni, Claudia Rossi. Paolo Servida. 
Alberto G. Ugazio. Fulvio Mavilio 

Adenosine deaminase (ADA) deficiency results In severe combined Immunodeficiency, 
the first genetic disorder treated by gene therapy. Two different retroviral vectors were 
used to transfer ex vivo the human ADA nnlnigene Into bone marrow cells and peripheral 
blood lymphocytes from two patients undergoing exogenous enzyme replacement ther- 
apy. After 2 years of treatment, long-term survival of T and B lymphocytes, mamw cells, 
and granulocytes expressing the transferred ADA gene was demonsUated and resulted 
In normalization of the immune repertoire and restoration of cellular and humoral immunity. 
After discontinuation of treatment, T lymphocytes, derived from transduced peripheral 
blood lymphocytes, were progressively replaced by man-ow-derived T cells in both pa- 
tients. These results indicate successful gene transfer into long-lasting progenitor cells, 
producing a functional multilineage progeny. 



Severe combined immunodeficiency asso- 
iiated widi inherited deficiency of ADA 

(1) is usually fatal unless afifected children 
ate kept in protective isolation or the im- 
nume system is reconstituted by bone mar- 
row transplantation from a human leuko- 
cyte antigen (HLAMdentical sibling donor 

(2) . This is the therapy of choice, although 
it is available only for a minority of patients. 
In recent years, other forms of therapy have 
been developed, including transplants from 
haploidenttcal donors {3,4), exogenous en- 
lyme replacement (5), and somatic-cell 
gene therapy (6-9). 

We previously reported a preclinical mod- 
el in which ADA gene transfer and expression 

C. BordiBnon, N. NtobBl. G. Ferrari. D. Maggioni. C. Rossi. 
P. Servida, F. Ma\^lo. Tdeihon Gene Therapy Program 
tor Genetic Dtseases, DBTT. bthuto Sdentifico K S. Ral- 
laela. Mbn. Italy. 

L D. Notarangeto. E Mazzolari, A. G. Ugazio, Deparl- 
mem d Pediatrics. Univorsiiy of Brescia Medcd Sdxol. 
Brescia. Italy. 

G. Casorati, Unitd d lrTmux>chimica, DtBTT, Istiluto Sd- 
entifico K S. Raffaele, Milan, lta>/. 
P. Partina. Roche Mitano flic efche. Mil»i. (laty, 

'To whom con-BsporxJonce should be addressed. 



successfully restored immune functions in hu- 
man ADA-deficient (ADA") periphctal 
blood lymphocytes (PBLs) in tmmurKxIefi- 
cient mice in vivo (JO, J J J. On the basis of 
these preclinical results, the clinical applica- 
tion of gene therapy for the treatment of 
ADA" SCID (severe combined imrtuinodefi- 
ciency disease) patients who previously felted 
exogenous enzyme replacement therapy was 
approved by our Institutional Ethical Com- 
mittees and by the Italian National Commit- 
tee for Bioeihics (12). In addition to evaluat- 
ing the safety and efficacy of the gei\c therapy 
procedure, the aim of the study was to define 
. the relative role of PBLs and hematopoietic 
stem cells in the long-term reconstitution of 
immune functions after retroviral veaor-me- 
diaicd ADA gene transfer. For diis purpose, 
two structurally identical vectors expressing 
the human ADA complementary DNA 
(cDNA). distinguishable by the presence of 
alternative restriction sites in a nonfunctional 
region of the viral long-terminal repeat 
(LTR), were used to transduce PBLs and bone 
marrow (BM) cells independently. This prcK 
cedure allowed identification of the origin of 



470 



SCIENCE • VOL 270 • 20 OCTOBER 1995 



PCX 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 

International Bureau 




Reference 4 of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 



INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification ^ : 
GOIN 33/543, 33/68 



Al 



(11) International Publication Number: WO 95/35505 

(43) Inteinattonal Publication Date: 28 December 1995 (28.12.95) 



(21) International Application Number: 

(22) International Filing Date: 



PCT/US95/07659 



16 June 1995 (16.06.95) 



(30) Priority Data: 
08/261,388 
08/477.809 



17 June 1994(17.06.94) 
7 June 1995 (07.06.95) 



US 
US 



(71) AppUcant: THE BOARD OF TRUSTEES OF THE LELAND 

STANFORD JUNIOR UNIVERSITY [US/US]; StarfonJ, 
CA 94305 (US). 

(72) Inventors: SHALON. Tidhar, Dari; 364 Fletcher Drive, 

Atherton, CA 94027 (US). BR03;VN, Patrick. O.; 76 Peter 
Coutts Circle, Stanfcmi, CA 94305 (US). 

(74) Agent: DEHUNGER. Peter, J.; Dehlinger & Associates, P.O. 
Box 60850. Palo Alto. CA 94306-1546 (US). 



(81) Designated States: AU, CA JP, European patent (AT, BE, 
CH, DE. DK. ES. FR. GB. GR. IE, IT, LU, MC. NL, PT, 
SE). 



Published 

With international search report. 



(54) Title: METHOD AND APPARATUS FOR FABRICATING MICROARRAYS OF BIOLOGICAL SAMPLES 
(57) Abstract 

A method and apparatus for fonnlng nucroanays of biological samples on a support are disclosed. Tlie method involves dispensing 
a known volume of a reagent at each of a selected array position, by tapping a capillary dispenser on the support under conditions effective 
to draw a defined volume of liquid onto the support The apparams is designed to produce a miooanay of such regions in an automated 
fashion. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international 
applications under the PCT. 



AT 


Austria 


GB 


United Kingdom 


MR 


Mauritania 


AU 


Australia 


GE 


Gcoigia 


MW 


Malawi 


BB 


Baitolos 


GN 


Guinea 


NE 


Niger 


BE 


Betgium 


GR 


Greece 


NL 


Nethertands 


BF 


Buridsa Faso 


HU 


Hungaiy 


NO 


Norway 


BG 


Bulgaria 


IE 


Ireland 


NZ 


New Zealand 


BJ 


Benin 


rr 


Italy 


PL 


Poland 


BR 


Brazil 


JP 


Japan 


PT 


PoitQgal 


BV 


Belarus 


KE 


Kenya 


RO 


Romania 


CA 




KG 


KyrgystBD 


RU 


Russian Federation 


CF 


Ceotittl African Republic 


KP 


DemoOBtic People's Republic 


SD 


Sudan 


CG 


Congo 




of Korea 


SE 


Sweden 


CH 


Switzerland 


KR 


Republic of Kocea 


SI 


Slovenia 


a 


CtXt dlvciie 


KZ 


Kazakhstan 


SK 


SlovaJda 


CM 


CauierooD 


U 


Liecfalenstein 


SN 


Senegal 


CN 


China 


LK 


Sri Lmka 


TD 


Chad 


CS 


Czecboslovalda 


LU 


Lnxonbouig 


TG 


Togo 


CZ 


Czech Republic 


LV 


Latvia 


TJ 


Tajikistan 


DE 


Gemsoy 


MC 


Monaco 


TT 


lYinidad and Tobago 


DK 


Dcnmarfc 


MD 


Republic of Moldova 


UA 


Ukraine 


ES 


Spain 


MG 


Madagascar 


US 


United States of America 


n 


Finland 


ML 


Mali 


UZ 


Uzbekistan 


FR 


^ance 


MN 


MoogoBa 


VN 


Viet Nam 


GA 


Gabon 











WQ 95/35505 



PCTAJS95/07659 



10 



KBTHOD MTO KPVABMVB TOR nBBXCArXXQ 
MICROMtRAYS or BI "T-ncTgiLt. n%iin>TMi 

Field of the Invention 

This invention relates to a method and apparatus 
for fabricating microarrays of biological samples for 
large scale screening assays, such as arrays of DNA 
samples to be used in DNA hybridization assays for 
genetic research and diagnostic applications • 



Abouzied, et al.. Journal of AOAC International 

77(2) :495-5D0 (1994). 

Bohlander, et al.. Genomics 13:1322-1324 (1992). 
15 Drmanac, et al.. Science 260:1^49-1652 (1993). 

Fodor^ et al.. Science 25l ;767-773 (1991). 

Khrapko, et al., DNA Sequence 1:375-388 (1991). 

Kuriyama, et al., |^ ISFET biosensor. Appli ed Bioseksors 
(Donald Wise, Ed.), Butterworths , pp. 93-114 (1989). 
20 Lehrach, et al., Hybridization Fingerprintin g in Genome 

MAPPING AND SEQUENCING. GENOME ANA I.Y5IS , VOL 1 (DavieS and 

Tilgham, Eds.), Cold Spring Harbor Press, pp. 39-81 
(1990) . 

Maniatis, et al., y^oi^EcuiAR cloning, a Laboratory 
25 MANUAL , Cold Spring Harbor Press (1989) . 

Nelson, et al.. Nature Genetics 4:11-18 (1993). 



wo 95/35505 



PCT/US95/07659 



2 

Pirrung, et al., U.S. Patent No. 5,143,8 54 (1992). 
Riles, et al.. Genetics 134:81-150 (1993). 
Schena, M. et al., Proc. Nat. Acad. Sci . USA 
89^3894-3898 (1992). 
5 Southern, et al.. Genomics 13:1008-1017 (1992). 

Background of the Invention 

A variety of methods are currently available for 
making arrays of biological macromolecules , such as 

10 arrays of nucleic acid molecules or proteins. One 
method for making ordered arrays of DNA on a porous 
membrane is a "dot blot" approach. In this method, a 
vacuum manifold transfers a plurality, e.g., 96, 
aqueous samples of DNA from 3 millimeter diameter wells 

15 to a porous membrane, A common variant of this 

procedure is a "slot-blot" method in which the wells 
have highly-elongated oval shapes. 

The DNA is immobilized on the porous membrane by 
baking the membrane or exposing it to UV radiation. 

20 This is a manual procedure practical for making one 

array at a time and usually limited to 96 samples per 
array. "Dot-blot" procedures are therefore inadequate 
for applications in which many thousand samples must be 
determined . 

25 A more efficient technique employed for meJcing 

ordered arrays of genomic fragments uses an array of 
pins dipped into the wells, e.g., the 96 wells of a 
microtitre plate, for transferring an array of samples 
to a substrate, such as a porous membrane. One array 

30 includes pins that are designed to spot a membrane in a 
staggered fashion, for creating an array of 9216 spots 
in a 22 X 22 cm area (Lehrach, et al., 1990). A 
limitation with this approach is that the volxime of DNA 
spotted in each pixel of each array is highly variable. 



wo 95/35505 



PCT/US95/07659 



3 

In addition, the number of arrays that can be made with 
each dipping is usually quite small. 

An alternate method of creating ordered arrays of 
nucleic acid sequences is described by Pirrung, et al. 
5 (1992), and also by Fodor, et al. (1991). The method 
involves synthesizing different nucleic acid sequences 
at different discrete regions of a support. This 
method employs elaborate synthetic schemes, and is 
generally limited to relatively short nucleic acid 

10 sample, e»g., less than 20 bases. A related method has 
been described by Southern, et al. (1992). 

Khrapko, et al. (1991) describes a method of 
making an oligonucleotide matrix by spotting DNA onto a 
thin layer of polyacrylamide. The spotting is done 

15 manually with a micropipette. 

None of the methods or devices described in the 
prior art are designed for mass fabrication o'f 
microarrays chsuracterized by (i) a large niimber of 
micrO"*6ized assay regions separated by a distance of 

20 50-200 microns or less, and (ii) a well-defined amount, 
typically in the picomole range, of analyte associated 
with each region of the array. 

Furthermore, current technology is directed at 
performing such assays one at a time to a single array 

25 of DNA molecules. For example, the most common method 
for performing DNA hybridizations to arrays spotted 
onto porous membrane involves sealing the meittbrane in a 
plastic bag (Maniatas, et al., 1989) or a rotating 
glass cylinder (Robbins Scientific) with the labeled 

30 hybridization probe inside the sealed chamber. For 
arrays made on non-porous surfaces, such as a 
microscope slide, each array is incubated with the 
labeled hybridization probe sealed under a coverslip. 
These techniques require a separate sealed chamber for 



wo ^5/35505 



PCTAJS95/07659 



4 

each array which makes the screening and handling of 
many such arrays inconvenient and time intensive. 

Abouzied, et al. (1994) describes a method of 
printing horizontal lines of antibodies on a 
5 nitrocellulose membrane and separating regions of the 
membrane with vertical stripes of a hydrophobic 
material. Each vertical stripe is then reacted with a 
different antigen and the reaction between the 
immobilized antibody and an antigen is detected using a 

10 standard ELISA color imetric technique. Abouzied's 
technic[ue makes it possible to screen many one- 
dimensional arrays simultaneously on a single sheet of 
nitrocellulose. Abouzied meOces the nitrocellulose 
somewhat hydrophobic using a line drawn with PAP Pen 

15 (Research Products International) . However Abouzied 
does not describe a technology that is capable of 
completely sealing the pores of the nitrocellulose. The 
pores of the nitrocellulose are still physically open 
and so the assay reagents can leak through the 

20 hydrophobic barrier during extended high temperature 
incubations or in the presence of detergents which 
makes the Abouzied technique unacceptable for DNA 
hybridization assays. 

Porous membranes with printed patterns of 

25 hydrophilic/hydrophobic regions exist for applications 
such as ordered arrays of bacteria colonies. QA Life 
Sciences (San Diego CA) makes such a membrane with a 
grid pattern printed on it. However, this membrane has 
the same disadvantage as the Abouzied technique since 

30 reagents can still flow between the gridded arrays 
making them unusable for separate DNA hybridization 
assays. 

Pall Corporation make a 96-well plate with a 
porous filter heat sealed to the bottom of the plate. 
35 These plates are capable of containing different 



t 



> 

wo 95/35505 



PCT/US95/07659 



5 

reagents in each well without cross-contamination. 
However, each well is intended to hold only one target 
element whereas the invention described here makes a 
microarray of many biomolecules in each subdivided 
5 region of the solid support. Furthermore, the 96 well 
plates are at least 1 cm thick and prevent the use of 
the device for many color imetric, fluorescent and 
radioactive detection formats which require that the 
membrane lie flat against the detection surface. The 

10 invention described here requires no fvurther processing 
after the assay step since the barriers elements are 
shallow and do not interfere with the detection step 
thereby greatly increasing convenience. 

Hyseq Corporation has described a method of making 

15 an ** array of arrays" on a non-porous solid support for 
use with their sequencing by hybridization technique. 
The method described by Hyseq involves modifying the 
chemistry of the solid support material to form a 
hydrophobic grid pattern where each subdivided region 

20 contains a microarray of biomolecules. Hyseq 's flat 
hydrophobic pattern does not make use of physical 
blocking as an additional means of preventing cross 
contamination • 

25 pinnnmy y pf the Invention 

The invention includes, in one aspect, a method of 
forming a microarray of analyte-assay regions on a 
solid support, where each region in the array has a 
known amount of a selected, analyte-specif ic reagent. 

30 The method involves first loading a solution of a 
selected analyte-specif ic reagent in a reagent- 
dispensing device having an elongate capillary channel 
(i) formed by spaced-apart , coextensive elongate 
members, (ii) adapted to hold a quantity of the reagent 

35 solution and (iii) having a tip region at which aqueous 



WQ 95/35505 PCTAUS95/07659 



6 

solution in the channel forms a meniscus. The channel 
is preferably formed by a pair of spaced-apart tapered 
elements. 

The tip of the dispensing device is tapped against 
5 a solid support at a defined position on the support 

surface with an impulse effective to break the meniscus 
in the capillary channel deposit a selected volime of 
solution on the surface, preferably a selected volume 
in the range 0.01 to 100 nl. The two steps are 

10 repeated until the desired surray is formed. 

The method may be practiced in forming a plurality 
of such arrays, where the solution-depositing step is 
are applied to a selected position on each of a 
plurality of solid supports at each repeat cycle. 

15 The dispensing device may be loaded with a new 

solution, by the steps of (i) dipping the capillary 
channel of the device in a wash solution, (ii') removing 
wash solution drawn into 'the capillary channel, and 
(iii) dipping the capillary channel into the new 

20 reagent solution. 

Also included in the invention is an automated 
apparatus for forming a microarray of analyte-assay 
regions on a plurality of solid supports, where each 
region in the array has a known amount of a selected, 

25 analyte-specific reagent. The apparatus has a holder 
for holding, at known positions, a pliirality of planar 
supports, and a reagent dispensing device of the type 
described above. 

The apparatus further includes positioning 

30 structure for positioning the dispensing device at a 
selected array position with respect to a support in 
said holder, and dispensing structure for moving the 
dispensing device into tapping engagement against a 
support with a selected impulse effective to deposit a 



WQ 95/35505 PCTAJS95/07659 



7 

selected volume on the support, e.g., a selected volume 
in the volume range 0.01 to 100 nl. 

The positioning and dispensing structures are 
controlled by a control unit in the apparatus. The 
5 unit operates to (i) place the dispensing device at a 
loading station, (ii) move the capillary channel in the 
device into a selected reagent at the loading station, 
to load the dispensing device with the reagent, and 
(iii) dispense the reagent at a defined array position 

10 on each of the supports on said holder. The tinit may 
further operate, at the end of a dispensing cycle, to 
wash the dispensing device by (i) placing the 
dispensing device at a washing station, (ii) moving the 
capillary channel in the device into a wash fluid, to 

15 load the dispensing device with the fluid, and (iii) 
remove the wash fluid prior to loading the dispensing 
device with a fresh selected reagent. 

The dispensing device in the apparatus may be one 
of a plxirality of such devices which are carried on the 

20 arm for dispensing different analyte assay reagents at 
selected spaced eorray positions. 

In another aspect, the invention includes a 
substrate with a surface having a microarray of at 
least 10^ distinct polynucleotide or polypeptide 

25 biopolymers in a surface area of less than about 1 cm^. 
Each distinct biopolymer (i) is disposed at a separate, 
defined position in said array, (ii) has a length of at 
least 50 subunits, and (iii) is present in a defined 
amount between about 0.1 femtomoles and 100 nanomoles. 

30 In one embodiment, the surface is glass slide 

surface coated with a polycationic polymer, such as 
poly lysine, and the biopolymers are polynucleotides. 
In another embodiment, the substrate has a water- 
impermeable backing, a water-permeable film formed on 



wo 95/35505 



PCTAJS95/07659 



8 

the backing, and a grid formed on the film. The grid 
is composed of intersecting water- impervious grid 
elements extending from said backing to positions 
raised above the surface of said film, and partitions 
5 the film into a plurality of water-impervious cells. A 
biopolymer array is formed within each well. 

More generally, there is provided a substrate for 
use in detecting binding of labeled polynucleotides to 
one or more of a plurality different-sequence, 

10 immobilized polynucleotides. The substrate includes, 
in one aspect, a glass support, a coating of a 
polycationic polymer, such as poly lysine, on said 
surface of the suppori:, and an array of distinct 
polynucleotides electrostatically bound non-covalently 

15 to said coating, where each distinct biopolymer is 

disposed at a separate, defined position in a surface 

r 

array of polynucleotides. 

In another aspect, the substrate includes a water- 
impermeable backing, a water-permeable film formed on 

20 the backing, and a grid formed on the film, where the 
grid is composed of intersecting water-impervious grid 
elements extending from the backing to positions raised 
above the surface of the film, forming a plurality of 
cells. A biopolymer array is formed within each cell. 

25 Also forming peart of the invention is a method of 

detecting differential expression of each of a 
plurality of genes in a first cell type, with respect 
to expression of the same genes in a second cell type. 
In practicing the method, there is first produced 

30 fluorescent-labeled cDNA's from mRNA's isolated from 
the two cells types, where the cDNA'S from the first 
and second cells are labeled with first and second 
different fluorescent reporters. 

A mixture of the labeled cDNA's from the two cell 

35 types is added to an array of polynucleotides 



1 



' wo 95/35505 



PCT/US95/07659 



9 

representing a plurality of known genes derived from 
the two cell types, under conditions that result in 
hybridization of the cDNA's to complementary-sequence 
polynucleotides in the array. The array is then 
5 examined by fluorescence under fluorescence excitation 
conditions in which (i) polynucleotides in the array 
that are hybridized predominantly to cDNA's derived 
from one of the first and second cell types give a 
distinct first or second fluorescence emission color, 

10 respectively, and (ii) polynucleotides in the array 

that are hybridized to substantially equal numbers of 
cDNA's derived from the first and second cell types 
give a distinct combined fluorescence emission color, 
respectively. The relative expression of known genes 

15 in the two cell types can then be determined by the 
observed fluorescence emission color of each spot. 

These and other objects and features of the 
invention will become more fully apparent when the 
following detailed description of the invention is read 

20 in conjunction with the accompanying figxires. 

Brief Description of the Dravinqs 
Fig. 1 is a side view of a reagent-dispensing 
device having a open-capillary dispensing head 
25 constructed for use in one embodiment of the invention; 

Figs. 2A-2C illustrate steps in the delivery of a 
f ixed-voliune bead on a hydrophobic surface employing 
the dispensing head from Fig. 1, in accordance with one 
embodiment of the method of the invention; 
30 Fig. 3 shows a portion of a two-dimensional array 

of analyte-assay regions constructed according to the 
method of the invention; 

Fig. 4 is a planar view showing components of an 
automated apparatus for forming arrays in accordance 
35 with the invention. 



I ,1 

wo 95/35505 



PCrAJS95/07659 



10 

♦ 

Fig. 5 shows a fluorescent image of an actual 20 x 
20 array of 400 fluorescent ly-labeled DNA samples 
immobilized on a poly-l-lysine coated slide, where the 
total area covered by the 400 element array is 16 

5 square millimeters ; 

Fig. 6 is a fluorescent image of a 1.8 cm x 1.8 cm 
microarray containing lambda clones with yeast inserts, 
the fluorescent signal arising from the hybridization 
to the array with approximately half the yeast genome 

10 labeled with a green f luorophore and the other half 
with a red f luorophore; 

Fig. 7 shows the translation of the hybridization 
image of Fig. 6 into a karyotype of the yeast genome, 
where the elements of Fig. -6 microarray contain yeast 

15 DNA sequences that have been previously physically 
mapped in the yeast genome; 

Fig. 8 show a fluorescent image ofa0.5cmx0.5 
cm microarray of 24 cDNA clones, where the microarray 
was hybridized simultaneously with total cDNA from wild 

20 type AraJbidopsis plant labeled with a green f luorophore 
and total cDNA from a transgenic Arabidopsis plant 
labeled with a red f luorophore, and the arrow points to 
the cDNA clone representing the gene introduced into 
the transgenic Arabidopsis plant; 

25 Fig. 9 shows a plan view of substrate having an 

array of cells formed by barrier elements in the form 
of a grid; 

Fig. 10 shows an enlarged plan view of one of the 
cells in the substrate in Fig. 9, showing an array of 
30 polynucleotide regions in the cell; 

Fig. 11 is an enlarged sectional view of the 
substrate in Fig. 9, taken along a section line in that 

figure; and 

Fig. 12 is a scanned image of a 3 cm x 3 cm 
35 nitrocellulose solid support containing four identical 



wo 95/35505 PCTAJS95/07659 



11 

arrays of M13 clones in each of four quadrants, where 
each quadrant was hybridized simultaneously to a 
different oligonucleotide using an open face 
hybridization method. 

5 

Detailed Description of the Invention 

!• Definitions 

Unless indicated otherwise, the terms defined 
below have the following meanings: 

10 "Ligand" refers to one member of a ligand/anti- 

ligand binding pair. The ligand may be, for example, 
one of the nucleic acid strands in a complementary, 
hybridized nucleic acid duplex binding pair; an 
effector molecule in an effector /receptor binding pair; 

15 or an antigen in an antigen/ antibody or 
antigen/ antibody fragment binding pair. 

"Antiligand" refers to the opposite member of a 
ligand/anti-ligand binding pair. The antiligand may be 
the other of the nucleic acid strands in a 

20 complementary, hybridized nucleic acid duplex binding 
pair; the receptor molecule in an effector /receptor 
binding pair; or an antibody or antibody fragment 
molecule in antigen/ antibody or antigen/ antibody 
fragment binding pair, respectively. 

25 "Analyte" or "analyte molecule" refers to a 

molecule, typically a macromolecule, such as a 
polynucleotide or polypeptide, whose presence, amount, 
and/or identity are to be determined. The analyte is 
one memtoer of a ligand/anti-ligand pair. 

30 "Analyte-specif ic assay reagent" refers to a 

molecule effective to bind specifically to an emalyte 
molecule. The reagent is the opposite member of a 
ligand/anti-ligand binding pair. 

An "array of regions on a solid support" is a 

35 linear or two-dimensional array of preferably discrete 



wo 95/35505 PCT/US95/07659 



12 

regions, each having a finite area, formed on the 
surface of a solid support. 

A "microarray" is an array of regions having a 
density of discrete regions of at least about lOO/cm^, 
5 and preferably at least about 1000/cm^. The regions in 
a microarray have typical dimensions, e.g., diameters, 
in the range of between about 10-250 ^m, and are 
separated from other regions in the array by about the 
same distance. 

10 A support surface is "hydrophobic" if a aqueous- 

medium droplet applied to the surface does not spread 
out substantially beyond the area size of the applied 
droplet. That is, the surface acts to prevent 
spreading of the droplet applied to the surface by 

15 hydrophobic interaction with the droplet. 

A "meniscus" means a concave or convex surface 

.• 

that forms on the bottom of a liquid in a channel as a 
result of the surface tension of the liquid. 

"Distinct biopolymers", as applied to the 

20 biopolymers forming a microarray, means an array member 
which is distinct from other array members on the basis 
of a different biopolymer sequence, and/ or different 
concentrations of the same or distinct biopolymers, 
and/ or different mixtures of distinct or different- 

25 concentration biopolymers. Thus an array of "distinct 
polynucleotides" means an array containing, as its 
members, (i) distinct polynucleotides, which may have a 
defined amount in each member, (ii) different, graded 
concentrations of given-sequence polynucleotides, 

30 and/or (iii) different-composition mixtures of two or 
more distinct polynucleotides. 

"Cell type" means a cell from a given source, 
e.g., a tissue, or organ, or a cell in a given state of 



I 



WQ 95/35505 



PCT/US95/07659 



13 

differentiation, or a cell associated with a given 
pathology or genetic makeup. 

II. Method of Microarrav Formation 
5 This section describes a method of forming a 

microarray of analyte-assay regions on a solid support 
or substrate I where each region in the array has a 
known amount of a selected, analyte-specif ic reagent. 

Fig. 1 illustrates, in a partially schematic view, 

10 a reagent-dispensing device 10 useful in practicing the 
method. The device generally includes a reagent 
dispenser 12 having an elongate open capillary channel 
14 adapted to hold a quantity of the reagent solution, 
such as indicated at 16, as will be described below. 

i5 The capillary channel is formed by a pair of spaced- 

apart, coextensive, elongate members 12a, 12b which are 
tapered toward one another and converge at a tip or tip 
region 18 at the lower end of the channel. More 
generally, the open channel is formed by at least two 

20 elongate, spaced«*apart members adapted to hold a 

quantity of reagent solutions and having a tip region 
at which aqueous solution in the channel forms a 
meniscus , such as the concave meniscus illustrated at 
20 in Fig. 2A. The advantages of the open channel 

25 construction of the dispenser are discussed below. 

With continued reference to Fig. 1, the dispenser 
device also includes structure for moving the dispenser 
rapidly toward and away from a support surface, for 
effecting deposition of a known amount of solution in 

30 the dispenser on a support, as will be described below 
with reference to Figs. 2A-2C. In the embodiment 
shown, this structure includes a solenoid 22 which is 
activatable to draw a solenoid piston 24 rapidly 
downwardly, then release the piston, e.g., under spring 

35 bias, to a normal, raised position, as shown. The 



wo 95/35505 PCTAJS95/07659 



14 

dispenser is carried on the piston by a connecting 
member 26, as shown. The just-described moving 
structure is also referred to herein as dispensing 
means for moving the dispenser into engagement with a 
5 solid support, for dispensing a known volume of fluid 
on the support. 

The dispensing device just described is carried on 
an arm 28 that may be moved either linearly or in an x- 
y plane to position the dispenser at a selected 

10 deposition position, as will be described. 

Figs. 2A-2C illustrate the method of depositing a 
known amount of reagent solution in the just-described 
dispenser on the surface of a solid support ^ such as 
the support indicated at 30. The support is a polymer, 

15 glass, or other solid-material support having a surface 
indicated at 31. 

In one general embodiment, the surface is a 
relatively hydrophilic, i.e., wettable surface, such as 
a surface having native, bound or covalently attached 

20 charged groups. On such surface described below is a 
glass surface having an absorbed layer of a 
polycationic polymer, such as poly-l-lysine. 

In another embodiment, the surface has or is 
formed to have a relatively hydrophobic character, 

25 i.e., one that causes aqueous medixim deposited on the 
surface to bead. A variety of known hydrophobic 
polymers, such as polystyrene, polypropylene, or 
polyethylene have desired hydrophobic properties, as do 
glass and a variety of lubricant or other hydrophobic 

30 films that may be applied to the support surface. 

Initially, the dispenser is loaded with a selected 
analyte-specific reagent solution, such as by dipping 
the dispenser tip, after washing, into a solution of 
the reagent, and allowing filling by capillary flow 

35 into the dispenser channel. The dispenser is now moved 



wo 95/35505 



PCTAJS95/07659 



15 

to a selected position with respect to a support 
surface, placing the dispenser tip directly above the 
support-surface position at which the reagent is to be 
deposited. This movement takes place with the 
5 dispenser tip in its raised position, as seen in Fig, 
2A, where the tip is typically at least several 1-5 mm 
above the surface of the substrate.. 

With the dispenser so positioned, solenoid 22 is 
now activated to cause the dispenser tip to move 

10 rapidly toward and away from the substrate surface, 
making momentary contact with the surface, in effect, 
tapping the tip of the dispenser against the support 
surface. The tapping movement of the tip against the 
surface acts to break the liquid meniscus in the tip 

15 channel, bringing the liquid in the tip into contact 
with the support surface. This, in turn, produces a 
flowing of the liquid into the capillary space between 
the tip and the surface, acting to draw liquid out of 
the dispenser channel, as seen in Fig. 2B. 

20 Fig. 2C shows flow of fluid from the tip onto the 

support surface, which in this case is a hydrophobic 
surface. The figure illustrates that liquid continues 
to flow from the dispenser onto the support siirface 
until it forms a liquid bead 32. At a given bead size, 

25 i.e., volume, the tendency of liquid to flow onto tJie 
surface will be balanced by the hydrophobic surface 
interaction of the bead with the support surface, which 
acts to limit the total bead area on the surface, and 
by the surface tension of the droplet, which tends 

30 toward a given bead curvature. At this point, a given 
bead voltune will have formed, and continued contact of 
the dispenser tip with the bead, as the dispenser tip 
is being withdrawn, will have little or no effect on 
bead volvune. 



wo 95/35505 



PCTAJS95/07659 



16 

For liquid-dispensing on a more hydrophilic 
surface, the liquid will have less of a tendency to 
bead, and the dispensed volume will be more sensitive 
to the total dwell time of the dispenser tip in the 
5 immediate vicinity of the support surface, e.gr., the 
positions illustrated in Figs. 2B and 2C. 

The desired deposition volume, i.e., bead volume, 
formed by this method is preferably in the range 2 pi 
(picoliters) to 2 nl (nanoliters) , although volumes as 

10 high as 100 nl or more may be dispensed. It will be 
appreciated that the selected dispensed volume will 
depend on (i) the "footprint" of the dispenser tip, 
i.e., the size of the area spanned by the tip, (11) the 
hydrophobicity of the support surface, and (iii) the 

15 time of contact with and rate of withdrawal of the tip 
from the support siirface. In addition, bead size may 
be reduced by increasing the viscosity of the medium, 
effectively reducing the flow time of liquid from the 
dispenser onto the support surface. The drop size may 

20 be fiirther constrained by depositing the drop in a 
hydrophilic region surrounded by a hydrophobic grid 
pattern on the support surface. 

In a typical embodiment, the dispenser tip is 
tapped rapidly against the support surface, with a 

25 total residence time in contact with the support of 
less than about 1 msec, and a rate of upward travel 
from the surface of about 10 cm/sec. 

Assuming that the bead that forms on contact with 
the surface is a hemispherical bead, with a diameter 

30 approximately equal to the width of the dispenser tip, 
as shown in Fig. 2C, the volume of the bead formed in 
relation to dispenser tip width (d) is given in Table 1 
below. As seen, the volume of the bead ranges between 
2 pi to 2 nl as the width size is increased from about 

35 20 to 200 /im. 



I 



I 

wo 95/35505 



PCr/US95/07659 



17 

Table 1 



d 


Volxune (nl) 


20 


2 X 10*^ 


50 /xm 


3.1 X 10-^ 


100 


2.5 X 10*^ 


200 /im 


■ 2 



10 At a given tip size, bead volvune can be reduced in 

a controlled fashion by increasing surface 
hydrophobicity, reducing time of contact of the tip 
with the surface, increasing rate of movement of the 
tip away from the surface, and/ or increasing the 

15 viscosity of the medium. Once these parameters are 

fixed, a selected deposition volume in the desired pi 
to nl range can be achieved in a repeatable fashion. 

After depositing a bead at one selected location 
on a support, the tip is typically moved to a 

20 corresponding position on a second support, a droplet 
is deposited at that position, and this process is 
repeated until a liquid droplet of the reagent has been 
deposited at a selected position on each of a plurality 
of supports. 

25 The tip is then washed to remove the reagent 

liquid, filled with another reagent liquid and this 
reagent is now deposited at each another array position 
on each of the supports. In one embodiment, the tip is 
washed and refilled by the steps of (i) dipping the 

30 capillary channel of the device in a wash solution, 
(ii) removing wash solution drawn into the capillary 
channel, and (iii) dipping the capillary channel into 
the new reagent solution. 

From the foregoing, it will be appreciated that 

35 the tweezers-like, open-capillary dispenser tip 



« 1 

wo 95/35505 



PCTAJS95/07659 



18 

provides the advantages that (i) the open channel of 
the tip facilitates rapid ^ efficient washing and drying 
before reloading the tip with a new reagent, (ii) 
passive capillary action can load the sample directly 
5 from a standard microwell plate while retaining 

sufficient sample in the open capillary reservoir for 
the printing of numerous arrays, (iii) open capillaries 
are less prone to clogging than closed capillaries, and 
(iv) open capillaries do not require a perfectly faced 

10 bottom surface for fluid delivery. 

A portion of a microarray 36 formed on the surface 
38 of a solid support 40 in accordance with the method 
just described is shown in Fig. 3. The array is formed 
of a plurality of analyte-specif ic reagent regions, 

15 such as regions 42, where each region may include a 
different analyte-specif ic reagent. As indicated 
above, the diameter of each region is prefersJ^ly 
between about 20-200 /xm. The spacing between each 
region and its closest (non^diagonal) neighbor ^ 

20 measured from center-to-center (indicated at 44) , is 

preferably in the range of about 20-400 fxisi. Thus, for 
example, an array having a center-to-center spacing of 
about 250 im contains about 40 regions/cm or 1,600 
regions /cm^. After formation of the array, the support 

25 is treated to evaporate the liquid of the droplet 

forming each region, to leave a desired array of dried, 
relatively flat regions. This drying may be done by 
heating or under vacuum. 

In some cases, it is desired to first rehydrate 

30 the droplets containing the analyte reagents to allow 
for more time for adsorption to the solid support. It 
is also possible to spot out the analyte reagents in a 
humid environment so that droplets do not dry until the 
arraying operation is complete. 



WQ 95/35505 PCT/US95/07659 



19 

III. Automated Apparatus for Forxnina Arrays 

In another aspect, the invention includes an 
automated apparatus for forming an array of analyte- 
assay regions on a solid support, where each region in 
5 the array has a known amount of a selected, analyte- 
specific reagent. 

The apparatus is shown in planar, and partially 
schematic view in Fig. 4. A dispenser device 72 in the 
apparatus has the basic construction described above 

10 with respect to Fig. 1, and includes a dispenser 74 

having an open-capillary channel terminating at a tip, 
substantially as shown in Figs. 1 and 2A-2C. 

The dispenser is mounted in the device for 
movement toward and away from a dispensing position at 

15 which the tip of the dispenser taps a support svirface, 
to dispense a selected volvime of reagent solution, as 
described above. This movement is effected by a 
solenoid 76 as described above. Solenoid 76 is under 
the control of a control unit 77 whose operation will 

20 be described below. The solenoid is also referred to 
herein as dispensing means for moving the device into 
tapping engagement with a support, when the device is 
positioned at a defined array position with respect to 
that support. 

25 The dispenser device is carried on an arm 74 which 

is threadedly mounted on a worm screw 80 driven 
(rotated) in a desired direction by a stepper motor 82 
also under the control of unit 77. At its left end in 
the figxire screw 80 is carried in a sleeve 84 for 

30 rotation about the screw axis. At its other end, the 
screw is mounted to the drive shaft of the stepper 
motor, which in turn is carried on a sleeve 86. The 
dispenser device, worm screw, the two sleeves mounting 
the worm screw, and the stepper motor used in moving 

35 the device in the "x" (horizontal) direction in the 



wo 95/35505 PCTAJS95/07659 



20 

figure form what is referred to here collectively as a 
displacement assembly 86. 

The displacement assembly is constructed to 
produce precise, micro-range movement in the direction 
5 of the screw, i.e., along an x axis in the figure. In 
one mode, the assembly functions to move the dispenser 
in X-axis increments haying a selected distance in the 
range 5-25 jum. In another mode, the dispenser \xnit may 
be moved in precise x-axis increments of several 

10 microns or more,: for positioning the dispenser at 

associated positions on adjacent supports, as will be 
described below. 

The displacement assembly, in turn, is mounted for 
movement in the "y" (vertical) axis of the figure, for 

15 positioning the dispenser at a selected y axis 

position. The structxare mounting the assembly includes 
a fixed rod 88 mounted rigidly between a pair of frame 
bars 90, 92, and a worm screw 94 mounted for rotation 
between a pair of frame bars 96, 98. The worm screw is 

20 driven (rotated) by a stepper motor 100 which operates 
under the control of unit 77. The motor is mounted on 
bar 96, as shown. 

The structure just described, including worm screw 
94 and motor 100, is constructed to produce precise, 

25 micro-range movement in the direction of the screw, 
i.e., along an y axis in the figure. As above, the 
structiire functions in one mode to move the dispenser 
in y-axis increments having a selected distance in the 
range 5-250 /xm, and in a second mode, to move the 

30 dispenser in precise y-axis increments of several 

microns I^m) or more, for positioning the dispenser at 
associated positions on adjacent supports. 

The displacement assembly and structure for moving 
this assembly in the y axis are referred to herein 

35 collectively as positioning means for positioning the 



wo 95/35505 



PCTAJS95/07659 



21 

dispensing device at a selected array position with 
respect to a support. 

A holder 102 in the apparatus functions to hold a 
plurality of supports, such as supports 104 on which 
5 the microarrays of regent regions are to be formed by 
the apparatus. The holder provides a number of 
recessed slots, such as slot 106, which receive the 
supports, and position them at precise selected 
positions with respect to the frame bars on which the 

10 dispenser moving means is mounted. 

As noted above, the control unit in the device 
functions to actuate the two stepper motors and 
dispenser solenoid in a sequence designed for automated 
operation of the apparatus in forming a selected 

15 microarray of reagent regions on each of a plvirality of 
supports . 

The control unit is constructed, according to 
conventional microprocessor control principles, to 
provide appropriate signals to each of the solenoid and 
20 each of the stepper motors, in a given timed sequence 
and for appropriate signalling time. The construction 
of the unit, and the settings that are selected by the 
user to achieve a desired array pattern, will be 
understood from the following description of a typical 

25 apparatus operation. 

Initially, one or more supports are placed in one 
or more slots in the holder. The dispenser is then 
moved to a position directly above a well (not shown) 
containing a solution of the first reagent to be 

30 dispensed on the support (s) . The dispenser solenoid is 
actuated now to lower the dispenser tip into this well, 
causing the capillary channel in the dispenser to fill. 
Motors 82, 100 are now actuated to position the 
dispenser at a selected array position at the first of 

35 the supports. Solenoid actuation of the dispenser is 



wo 95/35505 



PCTAJS95/07659 



22 

then effective to dispense a selected-volume droplet of 
that reagent at this location. As noted above, this 
operation is effective to dispense a selected volume 
preferably between 2 pi and 2 nl of the reagent 
5 solution. 

The dispenser is now moved to the corresponding 
position at an adjacent support and a similar volume of 
the solution is dispensed at this position. The 
process is repeated until the reagent has been 

10 dispensed at this preselected corresponding position on 
each of the supports. 

Where it is desired to dispense a single reagent 
at more than two array positions on a support, the 
dispenser may be moved to different array positions at 

15 each support, before moving the dispenser to a new 
support, or solution can be dispensed at individual 
positions on each support, at one selected position, 
then the cycle repeated for each new array position. 
To dispense the next reagent, the dispenser is 

20 positioned over a wash solution (not shown) , and the 
dispenser tip is dipped in and out of this solution 
until the reagent solution has been substantially 
washed from the tip. Solution can be removed from the 
tip, after each dipping, by vacuum, compressed air 

25 spray, sponge, or the like. 

The dispenser tip is now dipped in a second 
reagent well, and the filled tip is moved to a second 
selected array position in the first support. The 
process of dispensing reagent at each of the 

30 corresponding second** array positions is then carried as 
above. This process is repeated until an entire 
microarray of reagent solutions on each of the supports 
has been formed. 

35 IV. Microarray Substrate 



wo 95/35505 PCTAJS95/07659 



23 

This section describes embodiments of a substrate 
having a microarray of biological polymers carried on 
the substrate surface. Subsection A describes a multi- 
cell substrate, each cell of which contains a 
5 microarray, and preferably an identical microarray, of 
distinct biopolymers, such as distinct polynucleotides, 
formed on a porous surface. Subsection B describes a 
microarray of distinct polynucleotides bound on a glass 
slide coated with a polycat ionic polymer. 

10 

A. Multi-Cell Substrate 

Fig. 9 illustrates, in plan view, a substrate 110 
constructed according to the invention. The substrate 
has an 8 X 12 rectangular array 112 of cells, such as 

15 cells 114, 116, formed on the substrate surface. With 
. reference to Fig. 10, each cell, such as cell 114, in 
turn supports a microarray 118 of distinct biopolymers, 
such as polypeptides or polynucleotides at known, 
addressable regions of the microarray. Two such 

20 regions forming the microarray are indicat:ed at 120, 

and correspond to regions, such as regions 42, forming 
the microarray of distinct biopolymers shown in Pig. 3. 

The 96-cell array shown in Fig. 9 has typically 
array dimensions between about 12 and 244 mm in width 

25 and 8 and 400 mm in length, with the cells in the array 
having width and length dimension of 1/12 and 1/8 the 
array width and length dimensions, respectively, i.e., 
between about 1 and 20 in width and 1 and 50 mm in 
length . 

30 The construction of substrate is shown cross- 

sect ionally in Fig. 11, which is an enlarged sectional 
view taken along view line 124 in Fig. 9. The 
substrate includes a water-impermeable backing 126, 
such as a glass slide or rigid polymer sheet. Formed 

35 on the surface of the backing is a water-permeable film 



( « 

wo 95/35505 



PCTAJS95/07659 



24 

128. The film is formed of a porous membrane material, 
such as nitrocellulose membrane, or a porous web 
material, such as a nylon, polypropylene, or PVDF 
porous polymer material • The thickness of the film is 
5 preferably between about 10 and 1000 nn. The film may 
be applied to the backing by spraying or coating 
uncured material on the backing, or by applying a 
preformed membrane to the backing. The backing and 
film may be obtained as a preformed unit from 

10 commercial source, e.g., a plastic-backed 

nitrocellulose film available from Schleicher and 
Schuell Corporation. 

With continued reference to Fig. 11, the film- 
covered surface in the substrate is partitioned into a 

15 desired array of cells by water-impermeable grid lines, 
such as lines 130, 132, which have infiltrated the film 
down to the level of the backing, and extend above the 
surface of the film as shown, typically a distance of 
100 to 2000 tm above the film surface. 

20 The grid lines are formed on the substrate by 

laying down an uncured or otherwise f lowable resin or 
elastomer solution in an array grid, allowing the 
material to infiltrate the porous film down to the 
backing, then curing or otherwise hardening the grid 

25 lines to form the cell-array substrate. 

One preferred material for the grid is a f lowable 
silicone available from Loctite Corporation. The 
barrier material can be extruded through a narrow 
syringe (e.g., 22 gauge) using air pressure or 

30 mechanical pressure. The syringe is moved relative to 
the solid support, to print the barrier elements as a 
grid pattern- The extruded bead of silicone wicks into 
the pores of the solid support and cures to form a 
shallow waterproof barrier separating the regions of 

35 the solid support. 



* 



wo 95/35505 PCTAJS95/07659 

25 

In alternative embodiments, the barrier element 
can be a wax-based material or a thermoset material 
such as epoxy. The barrier material can also be a UV- 
curing polymer which is exposed to UV light after being 
5 printed onto the solid support. The barrier material 
may aliso be applied to the solid support using printing 
techniques such as silk-*screen printing* The barrier 
material may also be a heat*seal stamping of the porous 
solid support which seals its pores and forms a water- 

10 impervious barrier element. The barrier material may 
also be a shallow grid which is laminated or otherwise 
adhered to the solid support. 

In addition to plastic-backed nitrocellulose, the 
solid support can be virtually any porous membrane with 

15 or without a non-porous backing. Such membranes are 
readily available from numerous vendors and are made 
from nylon, PVDF, polysulfone and the like. In an 
alternative embodiment, *the barrier element may also be 
used to adhere the porous membrane to a non-porous 

20 backing in addition to f iinctioning as a barrier to 
prevent cross contamination of the assay reagents* 

In an alternative exobodiment, the solid support 
can be of a non-porous material. The barrier can be 
printed either before or after the microarray of 

25 biomolecules is printed on the solid support. 

As can be appreciated, the cells formed by the 
grid lines and the underlying backing are water- 
impermeable, having side barriers projecting above the 
porous film in the cells. Thus, def ined-volume samples 

30 can be placed in each well without risk of cross-- 

contamination with sample material in adjacent cells. 
In Fig. 11, defined volumes samples, such as sample 
134, are shown in the cells. 

As noted above, each well contains a microarray of 

35 distinct biopolymers. In one general embodiment, the 



I 



wo 95/35505 



PCT/DS95/07659 



26 

microarrays in the well are identical arrays of 
distinct biopolymers, e.g., different sequence 
polynucleotides. Such arrays can be formed in 
accordance with the methods described in Section II, by 
5 depositing a first selected polynucleotide at the same 
selected microarray position in each of the cells, then 
depositing a second polynucleotide at a different 
microarray position in each well, and so on until a 
complete, identical microarray is formed in each cell. 

10 In a preferred embodiment, each microarray 

contains about 10^ distinct polynucleotide or 
polypeptide biopolymers per sxirf ace area of less than 
about 1 cm^. Also in a preferred embodiment, the 
biopolymers in each microarray region are present in a 

15 defined amount between about 0.1 femtomoles and 100 

nanomoles. The ability to form high-density arrays of 
biopolymers, where each region is formed of a well- 
defined amount of deposited material, can be achieved 
in accordance with the microarray-forming method 

20 described in Section II. 

Also in a preferred embodiments, the biopolymers 
are polynucleotides having lengths of at least about 50 
bp, i.e., substantially longer than oligonucleotides 
which can be formed in high-density arrays by schemes 

25 involving parallel, step-wise polymer synthesis on the 
array surface. 

In the case of a polynucleotide array, in an assay 
procedure, a small volume of the labeled DNA probe 
mixture in a standard hybridization solution is loaded 

30 onto each cell. The solution will spread to cover the 
entire microarray and stop at the barrier elements. 
The solid support is then incubated in a humid chamber 
at the appropriate temperature as required by the 
assay. 



wo 95/35505 



PCTAJS95/07659 



27 

Each assay may be conducted in an "open-face" 
format where no further sealing step is required, since 
the hybridization solution will be kept properly 
hydrated by the water vapor in the humid chamber. At 
5 the conclusion of the incubation step, the entire solid 
support containing the numerous microarrays is rinsed 
quickly enough to dilute the assay reagents so that no 
significant cross contamination occurs. The entire 
solid support is then reacted with detection reagents 

10 if needed and analyzed using standard color imetric, 
radioactive or fluorescent detection means. All 
processing and detection steps are performed 
simultaneously to all of the microarrays on the solid 
support ensuring uniform assay conditions for all of 

15 the microarrays on the solid support. 

B. Glass-Slide Polynucleotide Array 
Fig. 5 shows a substrate 136 formed according to 
another aspect of the invention, and intended for use 
20 in detecting binding of labeled polynucleotides to one 
or more of a plurality distinct polynucleotides. The 
substrate includes a glass substrate 138 having formed 
on its surface, a coating of a polycationic polymer, 
preferably a cat ionic polypeptide, such as poly lysine 
25 or polyarginine. Formed on the polycationic coating is 
a microarray 140 of distinct polynucleotides, each 
localized at known selected array regions, such as 
regions 142. 

The slide is coated by placing a uniform-thickness 
30 film of a polycationic polymer, e.g., poly-l-lysine, on 
the surface of a slide and drying the film to form a 
dried coating. The amount of polycationic polymer 
added is sufficient to form at least a monolayer of 
polymers on the glass surface. The polymer film is 
35 bound to surface via electrostatic binding between 



I • 

wo 95/35505 



PCTAJS95/07659 



28 

• ■'* 

negative silyl-OH groups on the surface and charged 
amine groups in the polymers. Poly-l-lysine coated 
glass slides may be obtained commercially, e.g., from 
Sigma Chemical Co. (St. Louis, MO). 
5 To form the microarray, defined volumes of 

distinct polynucleotides are deposited on the polymer- 
coated slide, as described in Section II. According to 
an important feature of the substrate, the deposited 
polynucleotides remain bound to the coated slide 

10 surface non-covalently when an aqueous DNA sample is 
applied to the substrate under conditions which allow 
hybridization of reporter-labeled polynucleotides in 
the sample to complementary-sequence (single-stranded) 
polynucleotides in the substrate array. The method is 

15 illustrated in Examples 1 and 2. 

To illustrate this feature, a substrate of the 
type just described, but having an array of same- 
sequence polynucleotides, was mixed with fluorescent- 
labeled complementary DNA under hybridization 

20 conditions. After washing to remove non-hybridized 
material, the substrate was examined by low-power 
fluorescence microscopy. The array can be visualized 
by the relatively uniform labeling pattern of the array 
regions . 

25 In a preferred embodiment, each microarray 

contains at least 10^ distinct polynucleotide or 
polypeptide biopolymers per surface area of less than 
about 1 cm^. In the embodiment shown in Fig. 5, the 
microarray contains 400 regions in an area of about 16 

30 mm^, or 2.5 x 10^ regions/cm'. Also in a preferred 

embodiment, the polynucleotides in the each microarray 
region are present in a defined amount between about 
0.1 femtomoles and 100 nanomoles in the case of 
polynucleotides. As above, the ability to form high- 



wo 95/35505 PCrAJS95/07659 



29 

density arrays of this type, where each region is 
formed of a well-defined amount of deposited material, 
can be achieved in accordance with the microarray- 
forming method described in Section II. 
5 Also in a preferred embodiments, the 

■p 

polynucleotides have lengths of at least about 50 bp, 
i.e., substantially longer than oligonucleotides which 
can be formed in high-density arrays by various in situ 
synthesis schemes. 

10 

V. Utility 

Microarrays of immobilized nucleic acid sequences 
prepared in accordance with the invention can be used 
for large scale hybridization assays in niimerous 

15 genetic applications, including genetic and physical 

mapping of genomes, monitoring of gene expression, DNA 
sequencing, genetic diagnosis, genotyping of organisms, 
and distribution of DNA reagents to researchers. 

For gene mapping, a gene or a cloned DNA fragment 

20 is hybridized to an ordered array of DNA fragments, and 
the identity of the DNA elements applied to the array 
is uneunbiguously established by the pixel or pattern of 
pixels of the array that are detected. One application 
of such arrays for creating a genetic map is described 

25 by Nelson, et al. (1993). In constructing physical 
maps of the genome, arrays of immobilized cloned DNA 
fragments are hybridized with other cloned DNA 
fragments to establish whether the cloned fragments in 
the probe mixture overlap and are therefore contiguous 

30 to the immobilized clones on the array. For example, 
Lehrach, et al., describe such a process. 

The arrays of immobilized DNA fragments may also 
be used for genetic diagnostics. To illustrate, an 
array containing multiple forms of a mutated gene or 

35 genes can be probed with a labeled mixture of a 



wo 95/35505 PCT/US95/07659 



30 

patient's DNA which will preferentially interact with 
only one of the immobilized versions of the gene. 

The detection of this interaction can lead to a 
medical diagnosis • Arrays of immobilized DNA fragments 
5 can also be used in DNA probe diagnostics. For 

example, the identity of a pathogenic microorganism can 
be established unambiguously by hybridizing a sample of 
the unknown pathogen's DNA to an array containing many 
types of known pathogenic DNA. A similar technique can 

10 also be used for junambiguous genotyping of any 

organism. Other molecules of genetic interest, such as 
cDNA's and RNA's can be immobilized on the array or 
alternately used as the labeled probe mixture that is 
applied to the array. 

15 In one application, an array of cDNA clones 

representing genes is hybridized with total cDNA from 
an organism to monitor gene expression for research or 
diagnostic purposes. Labeling total cDNA from a normal 
cell with one color fluorophore and total cDNA from a 

20 diseased cell with another color fluorophore and 

simultaneously hybridizing the two cDNA samples to the 
same array of cDNA clones allows for differential gene 
expression to be measured as the ratio of the two 
fluorophore intensities. This two-color experiment can 

25 be used to monitor gene expression in different tissue 
types, disease states, response to drugs, or response 
to environmental factors. & An example of this approach 
is illustrated in Examples 2, described with respect to 
Fig. 8. 

30 By way of example and without implying a 

limitation of scope, such a procedure could be used to 
simultaneously screen many patients against all known 
mutations in a disease gene. This invention could be 
used in the form of, for example, 96 identical 0.9 cm x 

35 2.2 cm microarrays fabricated on a single 12 cm x 18 cm 



i 



I 

wo 95/35505 



PCT/US95/07659 



31 

sheet of plastic-backed nitrocellulose where each 
microarray could contain, for example, 100 DNA 
fragments representing all known mutations of a given 
gene. The region of interest from each of the DNA 
5 samples from 96 patients could be amplified, labeled, 
and hybridized to the 96 individual arrays with each 
assay performed in 100 microliters of hybridization 
solution- The approximately 1 thick silicone rubber 
barrier elements between individual arrays prevent 

10 cross contamination of the patient samples by sealing 
the pores of the nitrocellulose and by acting as a 
physical barrier between each microarray. The solid 
support containing all 96 microarrays assayed with the 
96 patient samples is incubated, rinsed, detected and 

15 analyzed as a single sheet of material using standard 
radioactive, fluorescent, or colorimetric detection 
means (Haniatas, et al., 1989) • Previously, such a 
procedure would involve the handling, processing and 
tracking of 96 separate membranes in 96 separate sealed 

20 chambers. By processing all 96 arrays as a single 

sheet of material, significant time and cost savings 
are possible* 

The assay format can be reversed where the patient 
or organism's DNA is immobilized as the array elements 

25 and each array is hybridized with a different mutated 
allele or genetic marker. The gridded solid support 
can also be used for parallel non-DNA ELISA assays. 
Furthermore, the invention allows for the use of all 
standard detection methods without the need to remove 

30 the shallow barrier elements to carry out the detection 
step . 

In addition to the genetic applications listed 
above, arrays of whole cells, peptides, enzymes, 
antibodies, antigens, receptors, ligands, 
35 phospholipids, polymers, drug cogener preparations or 



wo 95/35505 PCT/US95y07659 



32 

chemical substances can be fabricated by the means 
described in this invention for large scale screening 
assays in medical diagnostics, drug discovery, 
molecular biology, immunology and toxicology. 
5 The multi-cell substrate aspect of the invention 

allows for the rapid and convenient screening of many 
DNA probes against many ordered arrays of DNA 
fragments. This eliminates the need to handle and 
detect many individual arrays for performing mass 
10 screenings for genetic research and diagnostic 

applications. Numerous microarrays can be fabricated 
on the same solid support and each microarray reacted 
with a different DNA probe while the solid support is 
processed as a single sheet of material. 

15 

The following examples illustrate, but in no way 
are intended to limit, the present invention. 

Example 1 

20 Genomic-Complexitv Hvbridization to Micro 

DNA Arravs Representing the Yeast 
SacchajromycBS CBrBvisiae Genome with 
TwO''"Color Fluorescent Detection 

The array elements were randomly amplified PGR 

25 (Bohlander, et al., 1992) products using physically 

mapped lambda clones of S. cBrevisias genomic DNA 

templates (Riles, et al., 1993). The PGR was perfoirmed 

directly on the lambda phage lysates resulting in an 

amplification of both the 35 kb lambda vector and the 

30 5-15 kb yeast insert sec[uences in the form of a uniform 

distribution of PGR product between 250-1500 base pairs 

in length. The PGR product was purified using 

Sephadex G50 gel filtration (Pharmacia, Piscataway, NJ) 

and concentrated by evaporation to dryness at room 

35 temperature overnight. Each of the 864 amplified 



* 



■ 

wo. 95/35505 



PCT/US95/07659 



33 

lambda clones was rehydrated in 15 ^1 of 3 x SSC in 
preparation for spotting onto the glass. 

The micro arrays were fabricated on microscope 
slides which were coated with a layer of poly-l-lysine 
5 (Sigma) . The automated apparatus described in Section 
IV loaded 1 fil of the concentrated lambda clone PGR 
product in 3 X sSC directly from 96 well storage plates 
into the open capillary printing element and deposited 
-5 nl of sample per slide at 380 micron spacing between 

10 spots, on each of 40 slides. The process was repeated 
for all 864 samples and 8 control spots. After the 
spotting operation was complete, the slides were 
rehydrated in a htimid chamber for 2 hoiirs, bciked in a 
dry BO** vacuim oven for 2 hours, rinsed to remove un- 

15 absorbed DNA and then treated with succinic anhydride 
to reduce non-specific adsorption of the labeled 
hybridization probe to the poly-l-lysine coated glass 
surface. Immediately prior to use, the immobilized DNA 
on the array was denatured in distilled water at 90 

20 for 2 minutes. 

For the pooled chromosome experiment, the 16 
chromosomes of Saccharomyces cerevisxae were separated 
in a CHEF agarose gel apparatus (Biorad, Richmond, CA) . 
The six largest chromosomes were isolated in one gel 

25 slice and the smallest 10 chromosomes in a second gel 
slice. The DNA was recovered using a gel extraction 
kit (Qiagen, Chatsworth, CA) . The two chromosome pools 
were randomly amplified in a manner similar to that 
used for the target lambda clones. Following 

30 amplification, 5 micrograms of each of the amplified 

chromosome pools were separately random-primer labeled 
using Klenow polymerase (Amersham, Arlington Heights, 
IL) with a lissamine conjugated nucleotide analog 
(Dupont NEN, Boston, MA) for the pool containing the 

35 six largest chromosomes, and with a fluorescein 



* I 

wo 95/35505 



PCTAJS95/07659 



34 

* * 

conjugated nucleotide analog (BMB) for the pool 
containing smallest ten chromosomes. The two pools 
were mixed and concentrated using an ultrafiltration 
device (Amicon, Danvers, MA), 
5 Five micrograms of the hybridization probe 

consisting of both chromosome pools in 7.5 /xl of TE was 
denatured in a boiling water bath and then siiap cooled 
on ice. 2.5 fil of concentrated hybridization solution 
(5 X SSC and 0.1% SDS) was added and all 10 ^1 

10 transferred to the array surface, covered with a cover 
slip, placed in a custom-built single-slide hxmidity 
chamber and incubated at 60** for 12 hours. The slides 
were then rinsed at room temperature in 0.1 x ssc and 
0.1%SDS for 5 minutes, cover slipped and scanned. 

15 A custom built laser fluorescent scanner was used 

to detect the two-color hybridization signals from the 
1.8 X 1.8 cm array at 20 micron resolution. The 
scanned image was gridded and analyzed using custom 
image analysis software. After correcting for optical 

20 crosstalk between the fluorophores due to their 
overlapping emission spectra, the red and green 
hybridization values for each clone on the array were 
correlated to the known physical map position of the 
clone resulting in a computer-generated color karyotype 

25 of the yeast genome. 

Figure 6 shows the hybridization pattern of the 
two chromosome pools. A red signal indicates that the 
lambda clone on the array surface contains a cloned 
genomic DNA segment from one of the largest six yeast 

30 chromosomes. A green signal indicates that the lambda 
clone insert comes from one of the smallest ten yeast 
chromosomes. Orange signals indicate repetitive 
sequences which cross hybridized to both chromosome 
pools. Control spots on the array confirm that the 

35 hybridization is specific and reproducible. 



wo 95/35505 PCT/DS95/07659 



35 

The physical map locations of the genomic DNA 
fragments contained in each of the clones used as array 
elements have been previously determined by Olson and 
co-workers (Riles, et al.) allowing for the automatic 
5 generation of the color karyotype shown in Figure 7. 
The color of a chromosomal section on the karyotype 
corresponds to the color of the array element 
containing the clone from that section. The black 
regions of the karyotype represent false negative dark 

10 spots on the array (10%) or regions of the genome not 
covered by the Olson clone library (90%) . Note that 
the largest six chromosomes are mainly red while the 
smallest ten chromosomes are mainly green mat:ching the 
original CHEF gel isolation of the hybridization probe. 

15 Areas of the red chromosomes containing green spots and 
vice-versa are probably due to spurious seoaple tracking 
errors in the formation of the original library and in 
the eunplif ication and spotting procedures. 

The yeast genome arrays have also been probed with 

20 individual clones or pools of clones that are 

f luorescently labeled for physical mapping purposes. 
The hybridization signals of these clones to the array 
were translated into a position on the physical map of 
yeast . 

25 

Example 2 

Total cDNA Hybridized to Micro Arrays of 
cDNA Clones with Two-Color 
Fluorescent Detection 

30 24 clones containing cDNA inserts from the plant 

AraJbidopsis were amplified using PCR. Sal^t was added 

to the purified PCR products to a final concentration 

of 3 X SSC. The cDNA clones were spotted on poly-1- 

* 

lysine coated microscope slides in a manner similar to 
35 Example 1. Among the cDNA clones was a clone 



« 



wo 95/35505 PCrAJS95/07659 



36 

representing a transcription factor HAT 4, which had 
previously been used to create a transgenic line of the 
plant Arabidopsis , in which this gene is present at ten 
times the level found in wild-type AraJbidopsis (Schena, 
5 et al. , 1992) . 

Total poly-A mRNA from wild type Arabidopsis was 
isolated using standard methods (Maniatis, et al., 
1989) and reverse transcribed into total cDNA, using 
fluorescein nucleotide analog to label the cDNA product 

10 (green fluorescence) . A similar procedure was 

performed with the transgenic line of AraJbidopsis where 
the transcription factor HAT4 was inserted into the 
genome using standard gene transfer protocols. cDNA 
copies of mRNA from the transgenic plant are labeled 

15 with a lissamine nucleotide analog (red fluorescence) . 
Two micrograms of the cDNA products from each type of 
plant were pooled together and hybridized to the cDNA 
clone array in a 10 microliter hybridization reaction 
in a manner similar to Example 1* Rinsing and 

20 detection of hybridization was also performed in a 

manner similar to Example !• Fig. 8 show the resulting 
hybridization pattern of the array. 

Genes equally expressed in wild type and the 
transgenic Arabidopsis appeared yellow due to equal 

25 contributions of the green and red fluorescence to the 
final signal. The dots are different intensities of 
yellow indicating various levels of gene expression. 
The cDNA clone representing the transcription factor 
HAT4^ expressed in the transgenic line of Arabidopsis 

30 but not detectably expressed in wild type Arabidopsis^ 
appears as a red dot (with the arrow pointing to it) , 
indicating the preferential expression of the 
transcription factor in the red-labeled transgenic 
Arabidopsis and the relative lack of expression of the 



1 



1 

wo 95/35505 



PCTAJS95/07659 



37 

transcription factor in the green-labeled wild type 
Arabidopsis . 

An advantage of the microarray hybridization 
format for gene expression studies is the high partial 
5 concentration of each cDNA species achievable in the 10 
microliter hybridization reaction. This high partial 
concentration allows for detection of rare transcripts 
without the need for PGR amplification of the 
hybridization probe which may bias the true genetic 

10 representation of each discrete cDNA species* 

Gene expression studies such as these can be used 
for genomics research to discover which genes are 
expressed in which cell types, disease states, 
development states or environmental conditions. Gene 

15 expression studies can also be used for diagnosis of 
disease by empirically correlating gene expression 
patterns to disease states. 

Example 3 

Multiplexed Color imetric Hybridization on 

a Gridded Solid Support 

A sheet of plastic-backed nitrocellulose was 

gridded with barrier elements made from silicone rubber 

according to the description in Section IV-A. The 

sheet was soaked in 10 x SSC and allowed to dry. As 

shown in Fig. 12, 192 M13 clones each with a different 

yeast inserts were arrayed 400 microns apart in four 

quadrants of the solid support using the automated 

device described in Section III. The bottom left 

quadrant served as a negative control for hybridization 

while each of the other three quadrants was hybridized 

simultaneously with a different oligonucleotide using 

the open-face hybridization technology described in 

Section IV-A. The first two and last four elements of 



20 



25 



30 



t ) 

wo 95/35505 



PCrAJS95/07659 



38 

■ • 

each array are positive controls for the colorimetric 
detection step. 

The oligonucleotides were labeled with fluorescein 
which was detected using an anti-f luorescein antibody 
5 conjugated to alkaline phosphatase that precipitated an 
NBT/BCIP dye on the solid support (Amersham) . Perfect 
matches between the labeled oligos and the M13 clones 
resulted in dark spots visible to the naked eye and 
detected using an optical scanner (HP ScanJet II) 

10 attached to a personal computer. The hybridization 
patterns are different in every quadrant indicating 
that each oligo found several unique M13 clones from 
among the 192 with a perfect sequence match. Note that 
the open capillary printing tip leaves detectable 

15 dimples on the nitrocellulose which can be used to 
automatically align and analyze the images. 

Although the invention has been described with 
respect to specific embodiments and methods, it will be 
20 clear that various changes and modification may be made 
without depeurting from the invention. 



wo 95/35505 



PCTAJS95/07659 



39 

IT IS CLAIMED: 

1. A method of forming a microarray of analyte- 
assay regions on a solid support, where each region in 
5 the array has a known amount of a selected, analyte- 
specific reagent, said method comprising, • 

(a) loading a solution of a selected analyte- 
specific reagent in a reagent-dispensing device having 
an elongate capillary channel (i) formed by spaced- 

10 apart, coextensive elongate members, (ii) adapted to 
hold a quantity of the reagent solution and (iii) 
having a tip region at which aqueous solution in the 
channel fomus a meniscus, 

(b) tapping the tip of the dispensing device 

15 against a solid support at a defined position on the 
surface, with an impulse effective to break the 
meniscus in the capillary channel and deposit a 
selected volume of solution on the surface, and 

(c) repeating steps (a) and (b) until said array 
20 is formed. 

2. The method of claim 1, wherein said tapping is 
carried out with an impulse effective to deposit a 
selected volume in the volume range between 0.01 to 100 

25 nl. 

3. The method of claim 1, wherein said channel is 
formed by a pair of spaced-apart tapered elements. 

30 4. The method of claim 1, for forming a plurality 

of such arrays, wherein step (b) is applied to a 
selected position on each of a plurality of solid 
supports at each repeat cycle proceeding step (c) . 



wo 95/35505 



PCrAJS95/07659 



40 

5. The method of claim 1, which further includes, 
after performing steps (a) and (b) at least one time, 
reloading the reagent-dispensing device with a new 
reagent solution by the steps of (ij dipping the 
5 capillary channel of the device in a wash solution, 
(ii) removing wash solution drawn into the capillary 
channel, and (iii) dipping the capillary channel into 
the new reagent solution. 

10 6. Automated apparatus for forming a microarray 

of analyte-assay regions on a plurality of solid 
supports, where each region in the array has a known 
amount of a selected, analyte-specif ic reagent, said 
apparatus comprising 

15 (a) a holder for holding, at known positions, a 

plurality of planar supports, 

(b) a reagent dispensing device having ah open 
capillary channel (i) formed by spaced-apart , 
coextensive elongate members (ii) adapted to hold a 

20 guantity of the reagent solution and (iii) having a tip 
region at which aqueous solution in the channel forms a 
meniscus, 

(c) positioning means for positioning the 
dispensing device at a selected array position with 

25 respect to a support in said holder, 

(d) dispensing means for moving the device into 
tapping engagement against a support with a selected 
impulse, when the device is positioned at a defined 
array position with respect to that support, with an 

30 impulse effective to break the meniscus of liquid in 
the capillary channel and deposit a selected voliime of 
solution on the surface, and 

(e) control means for controlling said positioning 

and dispensing means. 



35 



wo 95/35505 PCT/US95/07659 



41 

1* The apparatus of claim 6, wherein said 
dispensing means is effective to move said dispensing 
device against a support with an impulse effective to 
deposit a selected volume in the volume range between 
5 0*01 to 100 nl. 

8. The apparatus of claim 6, wherein said channel 
is formed by a pair of spaced-apart tapered elements. 

9 . The apparatus of claim 6 , wherein the control 
means operates to (i) place the dispensing device at a 
loading station, (ii) move the capillary channel in the 
device into a selected reagent at the loading station, 
to load the dispensing device with the reagent, and 
(iii) dispense the reagent at a defined array position 
on each of the supports on said holder. 

10. The apparatus of claim 6, wherein the control 
device further operates, at the end of a dispensing 
cycle, to wash the dispensing device by (i) placing the 
dispensing device at a washing station, (ii) moving the 
capillary channel in the device into a wash fluid, to 
load the dispensing device with the fluid, and (iii) 
remove the wash fluid prior to loading the dispensing 
device with a fresh selected reagent. 

11. The apparatus of claim 6, wherein said device 
is one of a plurality of such devices which are carried 
on the arm for dispensing different analyte assay 

30 reagents at selected spaced array positions. 

12. A substrate with a surface having a 
microarray of at least 10^ distinct polynucleotide or 
polypeptide biopolymers per 1 cm^ surface area, each 



10 



15 



20 



25 



I t 

wo 95/35505 



PCTAJS95/07659 



42 

distinct biopolyiner sample (i) being disposed at a 
separate, defined position in said array, (ii) having a 
length of at least 50 subunits, and (iii) being present 
in a defined amount between about 0.1 femtomole and 100 
5 nanomoles • 

13. The substrate of claim 12, wherein said 
surface is glass slide coated with polylysine, and said 
biopolymers are polynucleotides. 

10 

14. The substrate of claim 12, wherein said 
stibstrate has a water- impermeable backing, a water- 
permeable film formed on the backing, and a grid formed 
on the film, where said grid (i) is composed of 

15 intersecting water- impervious grid elements extending 

from said backing to positions raised above the surface 
of said film, and (ii) partitions the film into a 
plurality of water-impervious cells, where each cell 
contains such a biopolymer array. 

20 

15 . A substrate with a surface array of sample- 
receiving cells, comprising 

a water- impermeable backing, 

a water-permeable film formed on the backing, and 
25 a grid formed on the film, said grid being composed of 
intersecting water- impervious grid elements extending 
from said backing to positions raised above "the surface 
of said film. 

30 16. The substrate of claim 15, wherein the cells 

of the array each contain an array of biopolymers. 

17. A substrate for use in detecting binding of 
labeled biopolymers to one or more of a plurality 
35 distinct polynucleotides, comprising 



wo 95/35505 PCTAJS95/076S9 



43 

a non-porous, glass substrate, 

a coating of a cationic polymer on said substrate, 

and 

an array of distinct polynucleotides to said 
5 coating, where each biopolymer is disposed at a 
separate, defined position in a surface array of 
biopolymers • 

18, A method of detecting differential expression 

10 of each of a plurality of genes in a first cell type 
with respect to expression of the same genes in a 
second cell types, said method comprising 

producing f luorescence*labeled cDNA's from mRNA's 
isolated from the two cells types, where the cDNA's 

15 from the first and second cells are labeled with first 
and second different fluorescent reporters, 

adding a mixture of the labeled cDNA's from the 
two cell types to an array of polynucleotides 
representing a plurality of known genes derived from 

20 the two cell types, under conditions that result in 

hybridization of the cDNA's to complementary-sequence 
polynucleotides in the eorray; and 

examining the array by fluorescence under 
fluorescence excitation conditions in which (i) 

25 polynucleotides in the array that are hybridized 

predominantly to cDNA's derived from one of the first 
and second cell types give a distinct first or second 
fluorescence emission color, respectively, and (ii) 
polynucleotides in the array that are hybridized to 

30 substantially equal numbers of cDNA's derived from the 
first and second cell types give a distinct combined 
fluorescence emission color, respectively, 

wherein the relative expression of known genes in 
the two cell types can be determined by the observed 

35 fluorescence emission color of each spot. 



WO.95/35505 



PCTAJS95/07659 



44 

19 • The method of claim 18, wherein the array of 
polynucleotides is formed on a substrate with a surface 
having an array of at least 10^ distinct polynucleotide 
or polypeptide biopolymers in a surface area of less 
5 than about 1 en?, each distinct biopolymer (i) being 

disposed at a separate, defined position in said array, 
(ii) having a length of at least 50 subunits, and (iii) 
beting present in a def ined amount between about ,1 
femtomole and 100 nmoles. 

10 

20. The method of claim 19, wherein said surface 
is a glass slide coated with poly lysine, and said 
biopolymers are polynucleotides non-covalently bound to 
said poly lysine. 

15 



1 I 

wo 95/35505 



PCT/DS9S/076S9 



1/6 




Fig. 2C 



wo 95/35505 PCTAJS95/07659 



2/6 



38 



(DOOOOOOOQOO 
OOOOOOOOOOO] 

o o o o o o o o o o 
o o o oo o o o o o 

O O O O O Q Q O O O 

6 o o o 
o o o o 

O O O 42 

o o o, 
o o 
o 



Fig. 3 



96 



f 



1 



I I 
I I 
I I 



84 



86 



\ 74. 




innnnnnnnnn 



82 
80 \ 



nnnnnnnnnnnn 




"J— 98 106 102 



1 


f. 




1 

1 





90 



I I 
I I 
I I 



i 



86 



88 



92 




. 4 



WO9S/3SS0S PCTAJS95/07659 



3/6 




Fig • D 



• ., ^ .■ ^ * 1 

^4 - ^ *• , ■ *l*• 



■ * I* v S ■ -t V .L: * ■. • . t, 

, > If ^ n ^ ^ ; , ^ y 

♦ «' r Jf V ,^ «- ?i -s^^ it *■« • •• ■ 

.-Sr.- //.> '.•y',*^;< ..^^..^ ; , • 



'.^ {■■• ^ e « . * v<i ♦•a^*© 

» •> V, <. « ;> . ..L ft » -9 » * ^ ^ . » np-h ' 



* •, * <• V , 

«• * ^ < ■/ 



. ■ ■ y n« ; it ",-•>■> 

• ■ -; * V , ♦ ^ V. 

r V * \ -; /« M , * » 



- «. t .-; ■. 



» • * V >. .' ^- ... 

=■ ■: > > . « ft 4 

• K e * H * « K , # 00 ^ ft 

• ; < ft It M V -o :- ■ 



^ • ■ ... .. »r . .>HL 

r ^S) fe(t « 1: ft* * 

..y -if * * A ^ » t .J ,< ft ^ ^ 
•» ■ *• « » 4 A « •'. «u ft # « 



f V » * 

« 'y » ,t V , p; 



; <»^ « ♦•^ ♦ ^ ^ 

r I ' ' ^. 

r..;, *t »»f p-y^ 

■ . : f-* St 

, ^ « f 4 « < « «. ^ 

•*«»»»* fr** 



> ' * • ' s ' * 4 > ». Hi ^ ,V ^ 1 . 



SUBSTITUTE SHEET {RULE 26) 



wo 95/35505 PCT/IIS95/076S9 



4/6 



11 13 15 

1 2 3 4 5 6 7 8 910 12 14 16 




i 



I 



i 



Fig. 7 




SUBSTITUTE SHEET (RULE 26) 



wo 9SO5505 



PCT/US95/07659 



5/6 




Fig. 9 



120 




118 



114 



Fig. 10 



wo 95/35505 



PCr/US95/07659 



6/6 



* 




Fig. 11 




Fig. 12 



SUBSTITUTE SHEET (RULE 26) 



INTERNATIONAL SEARCH REPORT 



li\»...t»aiional application No. 
PCT/US95/07659 



A- CLASSinCATlON OF SUBJECT MATTER 

IPC(6) : GO IN 33/543.33/68 

US CL :435/6; 436/518 
According to International Patent Classiftcation (IPC) or to both national classification and IPC 



D. FIELDS SEARCHED 



Minimum documentation searched (classification system followed by classification symbols) 
U.S. : 422/57; 435/4.6.973; 436/518,524.527,531.805.809 



Documentation searched other than minimum documentation lo the extent iliai such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practicable, search terms used) 



C. DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document, with indication, where appropriate, of tlie relevant passages 



Relevant to claim No. 



A,P 



US, A. 5,338,688 (DEEG ET AL) 16 August 1994, see entire 
document 

# 

US, A, 5,204,268 (MATSUMOTO) 20 April 1993, see entire 
document. 

US, A, 4,071,315 (CHATEAU) 31 January 1978, see entire 
document. 

US, A, 5,100,777 (CHANG) 31 March 1992, see entire 
document. 

US, A, 5,200,312 (OPRANDY) 06 April 1993, see entire 
document. 



1-17 



6-11 



12-17 



12-17 



12-17 



|~| Further documenu are listed in the continuation of Box C. Sec patent family annex. 



Speciil cau$ttria of ched documenu: 

doctimcai defining the general lUte of the art which m not cooMdered 
lo be of ptfticukr rekvtocc 

cartter documenl published on or niltt the inlmuliooiU fiUng date 

document which may throw doubu on priority ctaim(i) or which ii 
cited ui ola b lia h the publication dale of anMher citation or other 
■pcciot rcatoo (aa tpeciAed) 



*0' document referring to an oral dtBck»ure. uic. exhibition or other 



•A- 



•X' 



later document pufalikhed after the inieniational nting dale or prioniy 
date and not in conflict with the application but cited lo undcntaad the 
principle or theory underlying the invention 

document of particular relevance; the claimed invention caiuwl be 
co oa i deiBd novel or cannot be cooaideml to involve an inventive Hep 
when the documenl b taken akine 



document of particular relevance; the claimed invoUion cannot be 
conaidered lo tnvohre an inventive itcp when the document b 
combined with one wjpmwt ■tber tuch documenti. audi combination 
obvious to^>tfcnon akiUdl in the art 



documenl pubUsbed prior 10 the inlemaiional filing dale but later than 

the priority date daimed 



Date of the actual completion of the intemaiional search 



15 SEPTEMBER 1995 



Name and maiiinp address of the ISA/US 
Comfnissioncr of Paienu and Tradcixiarka 

Box per 

Washingtoa. D.C. 20231 
Facsimile No. a03) 305-3230 




Form PCT/lSA/210 (second sheet)(July 1992)i 



Reference 6 of 20 

- with Response dated 05/04/04 

InUSSN: 09/857,826 

Proc. Natl. Acad. Sci. USA 

Vol. 94, pp. 2150-2155, March 1997 

Biochemistry 



Discovery and analysis of inflammatory disease-related genes 
using cDNA microarrays 

(inflammation/human genome analysis /gene discovery) 

Renu a. HELLER*t, Mark Schena*, Andrew Chai*, Dari Shalon*, Tod Bedilion*, James Gilmore*, 
David E. Woolley§, and Ronald W. Davis* 

♦Department of Biochemistry, Beckman Center, Stanford University Medical Center, Stanford, CA 94305; tSynteni, Palo Alto, CA 94306; and ^Department of 
Medicine, Manchester Royal Infirmary, Manchester, United Kingdom 



Contributed by Ronald W. Davis, December 27 y 2996 

ABSTRACT cDNA microarray technology is used to profile 
complex diseases and discover novel disease-related genes. In 
inflammatory disease such as rheiunatoid arthritis, expression 
patterns of diverse cell types contribute to the pathology. We 
have monitored gene expression in this disease state with a 
microarray of selected human genes of probable significance in 
inflammation as well as with genes expressed in peripheral 
human blood cells. Messenger RNA from cultured macrophages, 
chondrocyte cell lines, primary chondrocytes, and synoviocytes 
provided expression profiles for the selected cytokines, chemo- 
kines, DNA binding proteins, and matrix-degrading metal- 
loproteinases. Comparisons between tissue samples of rheuma- 
toid arthritis and inflammatory bowel disease verified the in- 
volvement of many genes and revealed novel participation of the 
cytokine interleukin 3, chemokine Groa and the metal- 
loproteinase matrix metallo-elastase in both diseases. From the 
peripheral blood library, tissue inhibitor of metalloproteinase 1, 
ferritin light chain, and manganese superoxide dismutase genes 
were identified as expressed differentially in rheiunatoid arthri- 
tis compared with inflammatory bowel disease. These results 
successfully demonstrate the use of the cDNA microarray system 
as a general approach for dissecting human diseases. 



The recently described cDNA microarray or DNA-chip tech- 
nology allows expression monitoring of hundreds and thou- 
sands of genes simultaneously and provides a format for 
identifying genes as well as changes in their activity (1, 2). 
Using this technology, two-color fluorescence patterns of 
differential gene expression in the root versus the shoot tissue 
of Arabidopsis were obtained in a specific array of 48 genes (1). 
In another study using a 1000 gene array from a human 
peripheral blood library, novel genes expressed by T cells were 
identified upon heat shock and protein kinase C activation (3). 

The technology uses cDNA sequences or cDNA inserts of a 
library for PCR amplification that are arrayed on a glass slide with 
high speed robotics at a density of 1000 cDNA sequences per cm^. 
These microarrays serve as gene targets for hybridization to 
cDNA probes prepared from RNA samples of cells or tissues. A 
two-color fluorescence labeling technique is used in the prepa- 
ration of the cDNA probes such that a simultaneous hybridization 
but separate detection of signals provides the comparative anal- 
ysis and the relative abundance of specific genes expressed (1, 2). 
Microarrays can be constructed from specific cDNA clones of 
interest, a cDNA library, or a select number of open reading 
frames from a genome sequencing database to allow a large-scale 
functional analysis of expressed sequences. 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement" in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

Copyright © 1997 by The National Academy of Sciences of the USA 

0027-8424/97/9421 50-6$2.00/0 

PNAS is available online at http://www.pnas.org. 



Because of the wide spectrum of genes and endogenous 
mediators involved, the microarray technology is well suited 
for analyzing chronic diseases. In rheumatoid arthritis (RA), 
inflammation of the joint is caused by the gene products of 
many different cell types present in the synovium and cartilage 
tissues plus those infiltrating from the circulating blood. The 
autoimmune and inflammatory nature of the disease is a 
cumulative result of genetic susceptibility factors and multiple 
responses, paracrine and autocrine in nature, from macro- 
phages, T cells, plasma cells, neutrophils, synovial fibroblasts, 
chondrocytes, etc. Growth factors, inflammatory cytokines 
(4), and the chemokines (5) are the important mediators of this 
inflammatory process. The ensuing destruction of the cartilage 
and bone by the invading synovial tissue includes the actions 
of prostaglandins and leukotrienes (6), and the matrix degrad- 
ing metalloproteinases (MMPs). The MMPs are an important 
class of Zn-dependent metallo-endoproteinases that pan col- 
lectively degrade the proteoglycan and collagen components of 
the connective tissue matrix (7). 

This paper presents a study in which the involvement of 
select classes of molecules in RA was examined. Also inves- 
tigated were 1000 human genes randomly selected from a 
peripheral human blood cell library. Their differential and 
quantitative expression analysis in cells of the joint tissue, in 
diseased RA tissue and in inflammatory bowel disease (IBD) 
tissues was conducted to demonstrate the utility of the mi- 
croarray method to analyze complex diseases by their pattern 
of gene expression. Such a survey provides insight not only into 
the underlying cause of the pathology, but also provides the 
opportunity to selectively target genes for disease intervention 
by appropriate drug development and gene therapies. 

METHODS 

Microarray Design, Development, and Preparation. Two apK 

proaches for the fabrication of cDNA microarrays were used in 
this study. In the first approach, known human genes of probable 
significance in RA were identified. Regions of the clones, pref- 
erably 1 kb in length, were selected by their proximity to the 3' end 
of the cDNA and for areas of least identity to related and 
repetitive sequences. Primers were synthesized to amplify the 
target regions by standard PCR protocols (3). Products were 



Abbreviations: RA, rheumatoid arthritis; MMP, matrix-degrading 
metalloproteinase; IBD, inflammatory bowel disease; LPS, lipopoly- 
saccharide; PMA, phorbol 12-myristate 13-acetate; TNF-a, tumor 
necrosis factor a; IL, interleukin; TGF-)3, transforming growth factor 
j3; GCSF, granulocyte colony-stimulating factor; MIP, macrophage 
inflammatory protein; MIF, migration inhibitory factor; HME, human 
matrix metallo-elastase; RANTCS, regulated upon activation, normal 
T cell expressed and secreted; Gel, gelatinase; VCAM, vascular cell 
adhesion molecule; ICE, IL-1 converting enzyme; PUMP, putative 
metalloproteinase; MnSOD, manganese superoxide dismutase; TIMP, 
tissue inhibitor of metalloproteinase; MCP, macrophage chemotactic 
protein. 

TTo whom reprint requests should be sent at the present address: 
Roche Bioscience, S3-1, 3401 Hillview Avenue, Palo Alto, CA 94304. 



2150 



Biochemistry: Heller et al 



Proc. Natl Acad, ScL USA 94 (1997) 2151 



verified by gel electrophoresis and purified with Qiaquick 96-well 
purification kit (Qiagen, Chatsworth, CA), lyophilized (Savant), 
and resuspended in 5 ptl of 3 X standard saline citrate (SSC) buffer 
for arraying. In the second approach, the microarray containing 
the 1056 human genes from the peripheral blood lymphocyte 
library was prepared as described (3). 

Tissue Specimens. Rheumatoid synovial tissue was obtained 
from patients with late stage classic RA undergoing remedial 
synovectomy or arthroplasty of the knee. Synovial tissue was 
separated from any associated connective tissue or fat. One 
gram of each synovial specimen was subjected to RNA extrac- 
tion within 40 min of surgical excision, or explants were 
cultured in serum-free medium to examine any changes under 
in vitro conditions. For IBD, specimens of macroscopically 
inflamed lower intestinal mucosa were obtained from patients 
with Crohn disease undergoing remedial surgery. The hyper- 
trophied mucosal tissue was separated from underlying con- 
nective tissue and extracted for RNA. 

Cultured Cells. The Mono Mac-6 (MM6) monocytic cells 
(8) were grown in RPMI medium. Human chondrosarcoma 
SW1353 cells, primary human chondrocytes, and synoviocytes 
(9, 10) were cultured in DMEM; all culture media were 
supplemented with 10% fetal bovine serum, 100 ftg/ml strep- 
tomycin, and 500 units/ml penicillin. Treatment of cells with 
lipopolysaccharide (LPS) endotoxin at 30 ng/ml, phorbol 
12-myristate 13-acetate (PMA) at 50 ng/ml, tumor necrosis 
factor a (TNF-a) at 50 ng/ml, interleukin (IL)-lj3 at 30 ng/ml, 
or transforming growth factor- j3 (TGF-/3) at 100 ng/ml is 
described in the figure legends. 



Fluorescent Probe, Hybridization, and Scanning. Isolation of 
mRNA, probe preparation, and quantitation with Ambidopsis 
control mRNAs was essentially as described (3) except for the 
following minor modification. Following the reverse transcriptase 
step, the appropriate Cy3- and Cy5-labeled samples were pooled; 
mRNA degraded by heating the sample to 65°C for 10 min with 
the addition of 5 ^1 of 0.5M NaOH plus 0.5 ml of 10 mM EDTA. 
The pooled cDNA was purified from unincorporated nucleotides 
by gel filtration in Centri-spin columns (Princeton Separations, 
Adelphia, NJ). Samples were lyophilized and dissolved in 6 /il of 
hybridization buffer (5x SSC plus 0.2% SDS). Hybridizations, 
washes, scanning, quantitation procedures, and pseudocolor rep- 
resentations of fluorescent images have been described (3). Scans 
for the two fluorescent probes were normalized either to the 
fluorescence intensity of Ambidopsis mRNAs spiked into the 
labeling reactions (see Figs. 2-4) or to the signal intensity of 
j3-actin and glyceraIdehyde-3-phosphate dehydrogenase 
(GAPDH; see Fig. 5). 

RESULTS 

Ninety-Six-Gene Microarray Design. The actions of cytokines, 
growth factors, chemokines, transcription factors, MNffs, pros- 
taglandins, and leukotrienes are well recognized in inflammatory 
disease, particularly RA (11-14). Fig, 1 displays the selected genes 
for this study and also includes control cDNAs of housekeeping 
genes such as j3-actin and GAPDH and genes from Ambidopsis 
for signal normalization and quantitation (row A, columns 1-12). 

Defining Microarray Assay Conditions. Different lengths and 
concentrations of target DNA were tested by arraying PCR- 



BLANK BLANK 



ttiA 



IL1B 



HAT1 
HAT1 



ILIRA 

iL-IRA 



HAT1 



J I 



HAT4 



HAT4 



HAT22 ^rHAT22 



. 

YES23 ! YES23 



HAT1 li HAT4 \\ HAT4 i hAT22 J HAT22 , YES23 i YES23 



IL-6 



IL6R 

IL-6H 



CFOS 

C'fOS 




CJUN 

c-jun 




RFHA1 

Rat Fra-1 



lUO 
IblO 



IFNQ 



GCSF 



fS^CSF OMC^F TNFB.1 



G-CSF , M<^F GM^F TNFp 



OREL NFKB50 NFKB65.1 

c-rel NFkBpSO NFkBp55 



TNFA.1 TNFA^ TNFA.3 TNFA.4 TNFA.S TNFRI.1 TNFRI.2 TNFR1I.1 TNFRII.2 NFKBB5.2 



TNFa 



STR1 



GELA.1 

Ge!-A 



MCP-1 



TNFa 



STR2-3* 



TNFa 



STR3 



TNFa 



C0L1 



TNFa 



TNFfI 



TNFrl 



C0L1-3* C0L2.1 . C0U.2 



TNFrl! TNFfll NFkBpSS 



COLS 



Stronvl Strom-2 Strom-3 CoH-1 Coll-1.3' 



Colt-2 



GELB 



HME 



MTMMP PUMP1 TIMP1 



CoII-2 



TIMP2 



Con-3 



TtMP3 



Gel-8 Bastase (VfT-MMP Matrllysln TIMP-1 TlMp.2 TiMP-3 




MCPM MCP1.1 



MiP1A 



MIPIB 



MCP-1 MIP-1a ; MIP-1P 



RANTES 

RANTES 




1KB 



GRO 

GROIa 



CREB2 

CREB2 






A. thaliana controls 
Human controls 



Cytokines and related genes 
Transcription factors and related genes 
MMP's and related genes 



] Chemokines 
|~] Growth factors and related genes 
I Other genes 



Fig. 1. Ninety-six-element microarray design. The target element name and the corresponding gene are shown in the layout. Some genes have 
more than one target element to guarantee specificity of signal. For TNF the targets represent decreasing lengths of 1, 0.8, 0.6, 0.4, and 0.2 kb from 
left to right. 



2152 Biochemistry: Heller et al 



Proc, Natl Acad. ScL USA 94 (1997) 



amplified products ranging from 0.2 to 1.2 kb at concentrations 
of 1 ^g//Lt! or less. No significant difference in the signal levels was 
observed within this range of target size and only with 0.2-kb 
length was a signal reduced upon an 8-fold dilution of the 1 fig/jxl 
sample (data not shown). In this study the average length of the 
targets was 1 kb, with a few exceptions in the range of *«300 bp, 
arrayed at a concentration of 1 fxg/^\. Normally one PCR pro- 
vided sufficient material to fabricate up to 1000 microarray targets. 

In considering positional effects in the development of the 
targets for the microarrays, selection was biased toward the 3' 
proximal regions, because the signal was reduced if the target 
fragment was biased toward the 5' end (data not shown). This 
result was anticipated since the hybridizing probe is prepared by 
reverse transcription with oligo(dT)-primed mRNA and is richer 
in 3' proximal sequences. Cross-hybridizations of probes to 
targets of a gene family were analyzed with the matrix metal- 



loproteinases as the example because they can show regions of 
sequence identities of greater than 70%. With collagenase-1 
(Col-1) and collagenase-2 (Col-2) genes as targets with up to 70% 
sequence identity, and stromelysin-1 (Strom-1) and stromelysin-2 
(Strom-2) genes with different degrees of identity, our results 
showed that a short region of overlap, even with 70-90% se- 
quence identity, produced a low level of cross-hybridization. 
However, shorter regions of identity spread over the length of the 
target resulted in cross-hybridization (data not shown). For 
closely related genes, targets were designed by avoiding long 
stretches of homology. For members of a gene family two or more 
target regions were included to discriminate between specificity 
of signal versus cross-hybridization. 

Monitoring Differential Expression in Cultured Cell lines. In 
RA tissue, the monocyte/macrophage population plays a prom- 
inent role in phagocytic and immunomodulatory activities. Typ- 



iinitK(iicc<l 



I'j niintilLT. 





» c J O o c> Q^QK 




100 



B. 



I. Cytokidcc 



II. ChernokinoL". 



Tronccriptiot> Focloro 




I itii<: 1 tiiiiii -.) 

Fig. 2. Time course for LPS/PMA-induced MM6 cells. Array elements are described in Fig. I. (A) Pseudocolor representations of fluorescent 
scans correspond to gene expression levels at each time point. The array is made up of S A rabidopsis control targets and 86 human cDNA targets, 
the majority of which are genes with known or suspected involvement in inflammation. The color bars provide a comparative calibration scale 
between arrays and are derived from the Arabidopsis mRNA samples that are introduced in equal amounts during probe preparation. Fluorescent 
probes were made by labeling mRNA from untreated MM6 cells or LPS and PMA treated cells. mRNA was isolated at indicated times after 
induction. (BI-III) The two-color samples were cohybridized, and microarray scans provided the data for the levels of select transcripts at different 
time points relative to abundance at time zero. The analysis was performed using normalized data collected from 8-bit images. 



Biochemistry: Heller et al. 



Proc, Natl. Acad, ScL USA 94 (1997) 2153 



ically these cells, when triggered by an immunogen, produce the 
proinflammatroy cytokines TNF and IL-1. We have used the 
monocyte cell line MM6 and monitored changes in gene expres- 
sion upon activation with LPS endotoxin, a component of Gram- 
negative bacterial membranes, and PMA, which augments the 
action of LPS on TNF production (15). RNA was isolated at 
different times after induction and used for cDNA probe prep- 
aration. From this time course it was clear that TNF expression 
was induced within 15 min of treatment, reached maximum levels 
in 1 hr, remained high until 4 hr and subsequently declined (Fig. 
24). Many other cytokine genes were also transiently activated, 
such as IL-la and -)3, IL-6, and granulocyte colony-stimulating 
factor (GCSF). Prominent chemokines activated were IL-8, mac- 
rophage inflammatory protein (MIP)-lj3, more so than MlP-la, 
and Groa or melanoma growth stimulatory factor. Migration 
inhibitory factor (MIF) expressed in the uninduced state declined 
in LPS-activated cells. Of the immediate early genes, the notice- 
able ones were c-fos,fra-l, c-jun, NF-KBp50, and IkB, with c-rel 
expression observed even in the uninduced state (Fig. 25). These 
expression patterns are consistent with reported patterns of 
activation of certain LPS- and PMA-induced genes (12). Dem- 
onstrated here is the unique ability of this system to allow parallel 
visualization of a large number of gene activities over a period of 
time. 

SW1353 cells is a line derived from malignant tumors of the 
cartilage and behaves much like the chondrocytes upon stim- 
ulation with TNF and IL-1 in the expression of MMPs (9). In 
addition to confirming our earlier observations with Northern 
blots on Strom-1, Col-1, and Col-3 expression (9), gelatinase 
(Gel) A, putative metalloproteinase (PUMP)-l membrane- 



uniDducoct 



Cf <? O O i.> o 



2 hours 



flt 9 ± 

O O O Or O o 

O o O o o 

" o O o * o 



8 hours 



18 hours 




B. 



D. Chemokines 



TJ 

i 

1 itn 




TP' 




It). Tranicfiplion factor t 



tv. ibirix mstailofirotcinasos 



IP- 



109 



















-•-cou 







10 



Timet Iwuii) 



Fig. 3. Time course for IL-1/3 and TNF-induced SW1353 cells 
using the inflammation array (Fig. 1). {A) Pseudocolor representation 
of fluorescent scans correspond to gene expression levels at each time 
point. {B I-IV) Relative levels of selected genes at different time points 
compared with time zero. 



type matrix metalloproteinase, tissue inhibitors of matrix 
metalloproteinases or tissue inhibitor of metalloproteinase 1 
(TIMP-1), -2, and -3 were also expressed by these cells together 
with the human matrix metallo-elastase (HME; Fig. 3/4). HME 
induction was estimated to be «*50-fold and was greater than 
any of the other MMPs examined (Fig. 3B). This result was 
unexpected because HME is reportedly expressed only by 
alveolar macrophage and placental cells (16). Expression of 
the cytokines and chemokines, IL-6, IL-8, MIF, and MIP-lj3 
was also noted. A variety of other genes, including certain 
transcription factors, were also up-regulated (Fig. 3), but the 
overall time-dependent expression of genes in the SW1353 
cells was qualitatively distinct from the MM6 cells. 

Quantitation of differential gene expression (Figs. 2B and 
3B) was achieved with the simultaneous hybridization of 
Cy3-labeled cDNA from untreated cells and Cy5-labeled 
cDNA from treated samples. The estimated increases in 
expression from these microarrays for a select number of genes 
including IL-ljS, IL-8, MIP-1/3, TNF, HME, Col-1, Col-3, 
Strom-l, and Strom-2 were compared with data collected from 
dot blot analysis. Results (not shown) were in close agreement 
and confirmed our earlier observations on the use of the 
microarray method for the quantitation of gene expression (3). 

Expression Profiles in Primary Chondrocytes and Synovio- 
cytes of Human RA Tissue. Given the sensitivity and the 
specificity of this method, expression profiles of primary 
synoviocytes and chondrocytes from diseased tissue were 
examined. Without prior exposure to inducing agents, low level 
expression of c-jun, GCSF, IL-3, TNF-jS, MIF, and R ANTES 
(regulated upon activation, normal T cell expressed and se- 
creted) was seen as well as expression of MMPs, GelA, 
Strom-1, Col-1, and the three TIMPs. In this case, Col-2 
hybridization was considered to be nonspecific because the 
second Col-2 target taken from the 3' end of the gene gave no 

A. Human synovial fibroblasts B. Human articular chondrocytes 



^ O O O O . <f # 

C O O O i>f>0 O 

* O Q- o O 0£> 



uninduced 




uninduced 





PMA/IL-1 1'. 



PMA/lL-ir. 





Fig. 4. Expression profiles for early passage primary synoviocytes and 
chondrocytes isolated from RA tissue, cultured in the presence of 10% 
fetal calf serum and activated with PMA and IL-1/3, or TNF and IL-1/3, 
or TGF-p for 18 hr. The color bars provide a comparative calibration scale 
between arrays and are derived from the Arabidopsis mRNA samples that 
are introduced in equal amounts during probe preparation 



2154 Biochemistry: Heller et aL 

signal. Treatment more so with PMA and IL-1, than TNF and 
IL-1, produced a dramatic up-regulation in expression of 
several genes in both of these primary cell types. These genes 
are as follows: the cytokine IL-6, the chemokines IL-8 and 
Gro-la, and the MMPs; Strom-1, Col-1, Col-3, and HME; and 
the adhesion molecule, vascular cell adhesion molecule 1 
(VCAM-1). The surprise again is HME expression in these 
primary cells, for reasons discussed above. From these results, 
the expression profiles of synoviocytes and the chondrocytes 
appear very similar; the differences are more quantitative than 
qualitative. Treatment of the primary chondrocytes with the 
anabolic growth factor TGF-/3 had an interesting profile in that 
it produced a remarkable down-regulation of genes expressed 
in both the untreated and induced state (Fig. 4). 

Given the demonstrated effectiveness of this technology, a 
comparative analysis of two different inflammatory disease 
states was conducted with probes made from RA tissue and 
IBD samples. RA samples were from late stage rheumatoid 
synovial tissue, and IBD specimens were obtained from in- 
flamed lower intestinal mucosa of patients with Crohn disease. 
With both the 96-element known gene microarray and the 
1000-gene microarray of cDNAs selected from a peripheral 
human blood cell library (3), distinct differences in gene 
expression patterns were evident. On the 96-gene array, RA 
tissue samples from different affected individuals gave similar 
profiles (data not shown) as did different samples from the 
same individual (Fig. 5). These patterns were notably similar 
to those observed with primary synoviocytes and chondrocytes 
(Fig. 4). Included in the list of prominently up-regulated genes 
are IL-6, the MMPs Strom-1, Col-1, GelA, HME, and in 



A Rheiiinatoic arthritis B. Inftnnimotory bowel iJis2.ise 



• 










• 


























Q 




Q O 9 O* Q 






Q 








Q 


1 














■ c? ^ a 0 


o 


It 


OCf O O <' 0 o 




Q 


a o c& - 








0 Q 






o 


o ^ 




a 




- 










• 


* 




o 


o 






















RA21.CA 






1BDA 






















• U V 






o & y oo 














Q 




O O r ' o O O 






o 








Q 




^3 CP CfrO O 














o 


. t/ 


cro o ^ o «3 o 




o 




o 




- 


o & o o o o oooo^~ (P 




o 






















■ 


o 














o 










RA 21 .SB 






iBDCI 
















- 












O Q ?; O o 














u 
















0 O » 




















o 




CrOO ^ o 0 « 




0 










o 








o 




o 




• 














D 




0 














oooo 



RA2i.5C IBOCII 
0 2 J) ^3 3,1 4.7 7.e 14.1 39.0 51 .« 100 



Fig. 5. Expression profiles of RA tissue (A) and IBD tissue (B), 
mRNA from R A tissue samples obtained from the same individual was 
isolated directly after excision (RA 21. 5 A) or maintained in culture 
without serum for 2 hr (RA 21.5B) or for 6 hr (RA 21.50). Profiles 
from tissue samples of two other individuals (data not shown) were 
remarkably similar to the ones shown here. IBD-A and IBD-CI are 
from mRNA samples prepared directly after surgery from two sepa- 
rate individuals. For the IBD-CII probe, the tissue sample was cultured 
in medium without serum for 2 hr before mRNA preparation. 



Proc. NatL Acad. Sci. USA 94 (2997) 

certain samples PUMP, TIMPs, particularly TIMP-1 and 
TIMP-3, and the adhesion molecule VCAM. Discernible levels 
of macrophage chemotactic protein 1 (MCP-1), MIF and 
RANTES were also noted. IBD samples were in comparison, 
rather subdued although IL-1 converting enzyme (ICE), 
TIMP-1, and MIF were notable in all the three different IBD 
samples examined here. In IBD-A, one of three individual 
samples, ICE, VCAM, Groa, and MMP expression was more 
pronounced than in the others. 

We also made use of a peripheral blood cDNA library (3) 
to identify genes expressed by lymphocytes infiltrating the 
inflamed tissues from the circulating blood. With the 1046- 
element array of randomly selected cDNAs from this library, 
probes made from R A and IBD samples showed hybridizations 
to a large number of genes. Of these, many were common 
between the two disease tissues while others were differentially 
expressed (data not shown). A complete survey of these genes 
was beyond the scope of this study, but for this report we 
picked three genes that were up-regulated in the RA tissue 
relative to IBD. These cDNAs were sequenced and identified 
by comparison to the GenBank database. They are TIMP-1, 
apoferritin light chain, and manganese superoxide dismutase 
(MnSOD). Differential expression of MnSOD was only ob- 
served in samples of RA tissue explants maintained in growth 
medium without serum for anywhere between 2 to 16 hr. These 
results also indicate that the expression profile of genes can be 
altered when explants are transferred to culture conditions. 

DISCUSSION 

The speed, ease, and feasibility of simultaneously monitoring 
differential expression of hundreds of genes with the cDNA 
microarray based system (1-3) is demonstrated here in the 
analysis of a complex disease such as RA. Many different cell 
types in the RA tissue; macrophages, lymphocytes, plasma cells, 
neutrophils, synoviocytes, chondrocytes, etc. are known to con- 
tribute to the development of the disease with the expression of 
gene products known to be proinflammatory. They include the 
cytokines, chemokines, growth factors, MMPs, eicosanoids, and 
others (7, 11-14), and the design of the 96-element known gene 
microarray was based on this knowledge and depended on the 
availability of the genes. The technology was validated by con- 
firming earlier observations on the expression of TNF by the 
monocyte cell line MM6, and of Col-1 and Col-3 expression in the 
chondrosarcoma cells and articular chondrocytes (9, 12). In our 
time-dependent survey the chronological order of gene activities 
in and between gene families was compared and the results have 
provided unprecedented profiles of the cytokines (TNF, IL-1, 
IL-6, GCSF, and MIF), chemokines (MlP-la, MIP-1)3, IL-8, and 
Gro-1), certain transcription factors, and the matrix metal- 
loproteinases (GelA, Strom-1, Col-1, Col-3, HME) in the mac- 
rophage cell line MM6 and in the SW1353 chondrosarcoma cells. 

Earlier reports of cytokine production in the diseased state had 
established a model in which TNF is a major participant in RA. 
Its expression reportedly preceded that of the other cytokines and 
effector molecules (4). Our results strongly support these results 
as demonstrated in the time course of the MM6 cells where TNF 
induction preceded that of IL-la and IL-)3 followed by IL-6 and 
GCSF. These expression profiles demonstrate the utility of the 
microarrays in determining the hierarachy of signaling events. 

In the SW1353 chondrosarcoma cells, all the known MMPs and 
TIMPs were examined simultaneously. HME expression was 
discovered, which previously had been observed in only the 
stromal cells and alveolar macrophages of smoker's lungs and in 
placental tissue. Its presence in cells of the RA tissue is mean- 
ingful because its activity can cause significant destruction of 
elastin and basement membrane components (16, 17). Expression 
profiles of synovial fibroblasts and articular chondrocytes were 
remarkably similar and not too different from the SW1353 cells, 
indicating that the fibroblast and the chondrocyte can play equally 
aggressive roles in joint erosion. Prominent genes expressed were 



Biochemistry: Heller et al 



Proc. Nad. Acad, ScL USA 94 (1997) 2155 



the MMPs, but chemokines and cytokines were also produced by 
these cells. The effect of the anabolic growth factor TGF-/3 was 
profoundly evident in demonstrating the down regulation of these 
catabolic activities. 

RA tissue samples undeniably reflected profiles similar to 
the cell types examined. Active genes observed were IL-3, IL-6, 
ICE, the MMPs including HME and TIMPs, chemokines IL-8, 
Groa, MIP, MIF, and RANTES, and the adhesion molecule 
VCAM. Of the growth factors, fibroblast growth factor )3 was 
observed most frequently. In comparison, the expression 
patterns in the other inflammatory state (i.e., IBD) were not 
as marked as in the RA samples, at least as obtained from the 
tissue samples selected for this study. 

As an alternative approach, the 1046 cDNA microarray of 
randomly selected genes from a lymphocyte library was used to 
identify genes expressed in RA tissue (3). Many genes on this 
array hybridized with probes made from both RA and IBD tissue 
samples. The results are not surprising because inflammatory 
tissue is abundantly supplied with cell types infiltrating from the 
circulating blood, made apparent also by the high levels of 
chemokine expression in R A tissue. Because of the magnitude of 
the effort required to identify all the hybridized genes, we have for 
this report chosen to describe only tluree differentially expressed 
genes mainly to verify this method of analysis. 

Of the large number of genes observed here, a fair number 
were already known as active participants in inflammatory dis- 
ease. These are TNF, IL-1, IL-6, IL-8, GCSF, RANTES, and 
VCAM, The novel participants not previously reported are 
HME, IL-3, ICE, and Groa. With our discovery of HME 
expression in RA, this gene becomes a target for drug interven- 
tion. ICE is a cysteine protease well known for its IL-1 j3 process- 
ing activity (18), and recognized for its role in apoptotic cell death 
(19), Its expression in RA tissue is intriguing. IL-3 is recognized 
for its growth-promoting activity in hematopoietic cell lineages, is 
a product of activated T cells (20), and its expression in synovio- 
cytes and chondrocytes of R A tissue is a novel observation. 

Like IL-8, Groa, is a C-X-C subgroup chemokine and is a 
potent neutrophil and basophil chemoattractant. It down- 
regulates the expression of types I and III interstitial collagens 
(21, 22) and is seen here produced by the MM6 cells, in primary 
synoviocytes, and in RA tissue. With the presence of RANTES, 
MCP, and MIP-1/3, the C-C chemokines (23) migration and 
infiltration of monocytes, particularly T cells, into the tissue is 
also enhanced (5) and aid in the trafficking and recruitment of 
leukocytes into the RA tissue. Their activation, phagocytosis, 
degranulation, and respiratory bursts could be responsible for 
the induction of MnSOD in RA. MnSOD is also induced by 
TNF and IL-1 and serves a protective function against oxida- 
tive damage. The induction of the ferritin light chain encoding 
gene in this tissue may be for reasons similar to those for 
MnSOD. Ferritin is the major intracellular iron storage protein 
and it is responsive to intracellular oxidative stress and reactive 
oxygen intermediates generated during inflammation (24, 25). 
The active expression of TIMP-1 in RA tissue, as detected by 
the 1000-element array, is no surprise because our results have 
repeatedly shown TIMP-1 to be expressed in the constitutive 
and induced states of RA cells and tissues. 

The suitability of the cDNA microarray technology for 
profiling diseases and for identifying disease related genes is 
well documented here. This technology could provide new 



targets for drug development and disease therapies, and in 
doing so allow for improved treatment of chronic diseases that 
are challenging because of their complexity. 

We would like to thank the following individuals for their help in 
obtaining reagents or providing cDNA clones to use as templates in 
target preparation: N. Arai, P. Cannon, D. R. Cohen, T. Curran, V, 
Dixit, D. A, Geller, G. I. Goldberg, M. Karin, M, Lotz, L. Matrisian, 
G, Nolan, C. Lopez-Otin, T. Schall, S. Shapiro, I. Verma, and H, Van 
Wart. Support for R.W.D., M.S., and R.A.H. was provided by the 
National Institutes of Health (Grants R37HG00198 and HG00205), 

1. Schena, M., Shalon, D., Davis, R. W. & Brown, P.O. (1995) 
Science 270, 467-470. 

2. Shalon, D., Smith, S. & Brown, P. O. (1996) Genome Res. 6, 
639-645. 

3. Schena, M., Shalon, D., Heller, R., Chai, A,, Brown, P. O. & 
Davis, R, W. (1996) Proc. Natl. Acad. Sci. USA 93, 10614-10619. 

4. Feldmann, M., Brennan F. M. & Maini, R. N. (1996) Rheumatoid 
Arthritis Cell 85, 307-310. 

5. Schall, T. J, (1994) in The Cytokine Handbook^ ed. Thomson, 
A. W. (Academic, New York), 2nd Ed., pp, 410-460. 

6. Lotz, M. F., Blanco, J., Von Kempis, J,, Dudler, J., Maier, R., 
Villiger P. M. & Geng, Y. (1995) / Rheumatol. 22, Supplement 
43, 104-108. 

7. Birkedal-Hansen, H., Moore, W. G. I., Bodden, M. K., Windsor, 
L. J., Birkedal-Hansen, B., DeCarlo, A. &. Engler, J. A. (1993) 
Crit. Rev. Oral Biol. Med. 4, 197-250. 

8. Zeigler-Heitbrock, H. W. L, Thiel, E., Futterer, A., Volker, H., 
Wirtz, A. & Reithmuller, G. (1988) Int. J. Cancer 41, 456-461. 

9. Borden, P., Solymar, D., Sucharczuk, A., Lindman, B., Cannon, 
P. & Heller, R. A. (1996) /. Biol. Chem. 271, 23577-23581. 

10. Gadher, S. J. & Woolley, D. E. (1987) Rheumatol. Int. 7, 13-22. 

11. Harris, E, D,, Jr. (1990) New Engl. I Med. 322, 1277-1289. 

12. Firestein, G. S. (1996) in Textbook of Rheumatology, eds, Kelly, 
W. N., Harris, E. D., Ruddy, S. & Sledge, C. B. (Saunders, 
Philadelphia), 5th Ed. pp. 5001-5047. 

13. Alvaro-Garcia, J. M., Zvaif ler, Nathan J,, Brown, C B., Kaush- 
ansky, K. & Firestein, Gary S. (1991)7. Immunol. 146, 3365-3371. 

14. Firestein, G. S., AIvaro-Grarcia, J. M. & Maki, R. (1990) /. Im- 
munol. 144, 3347-3352. 

15. Pradines-Figueres, A. & Raetz, C. R. H. (1992) / Biol. Chem. 
267, 23261-23268. 

16. Shapiro, S. D., Kobayashi, D. L. & Ley, T. J. (1993)/. Biol. Chem. 
208, 23824-23829. 

17. Shipley, M. J., Wesselschmidt, R. L, Kobayashi, D. K., Ley, T. J. 
& Shapiro, S, D. (1996) Proc. Natl Acad. Sci. USA 93, 3042-3946. 

18. Cerreti, D. P., Kozlosky, C. J., Mosley, B., Nelson, N., Van Ness, K., 
Greenstreet, T. A., March, C J., Kronheim, S. R., Druck, T, Can- 
nizaro, L A., Huebner, K. & Black, R. A. (1992) Science 256, 97-100, 

19. Miura, M., Zhu, H., Rotello, R., Hartweig, E. A. & Yuan, J. 
(1993) Cell 75, 653-660. 

20. Arai, K., Lee, F., Miyajima, A., Shoichiro, M,, Arai, N. & Takashi, 
Y. (1990) Annu. Rev. Biochem. 59, 783-836. 

21. Geiser, T., Dewald, B., Ehrengruber, M. U,, Lewis, I, C. & 
Baggiolini, M. (1993) I Biol. Chem. 268, 15419-15424. 

22. Unemori, E. N., Amento, E. P, Bauer, E. A. & Horuk, R. (1993) 
J. Biol Chem. 268, 1338-1342. 

23. Robinson, E., Keystone, E. C, Schall, T. J., Gillet, N. & Fish, 
E. N. (1995) Clin, Exp. Immunol 101, 398-407. 

24. Roeser, H. (1980) in Iron Metabolism in Biochemistry and Med- 
icine, eds. Jacobs, A. & Worwood, M. (Academic, New York), 
Vol. 2, pp. 605-640. 

25. Kwak, E. L., Larochelle, D. A., Beaumont, C, Torti, S. V. & 
Torti, F. M. (1995) / Biol Chem. 270, 15285-15293. 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 

International Bureau 




Reference 7 of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 



PCX 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification ^ : 
C12Q 1/68, C07H 21704 



Al 



(11) International Publication Number: 
(43) International Publication Date: 



WO 97/13877 

17 April 1997 (17.04.97) 



(21) International Application Number: 



PCr/US96/ 16342 



(22) International FQing Date: 



II October 1996 (11.10.96) 



(30) Priority Data: 

PCr/US95/ 12791 12 October 1995 (12.10.95) WO 
(34) Countries for which the regional or 

internationai application was filed: US ct al. 

PCT/US96/09513 6 June 1996 (06.06.96) WO 
{34} Countries for which the regional or 

international application was filed: US ct al. 



(60) Parent Application or Grant 
(63) Related by Continuation 
US 

. . -Filed on.. * - v - ^ 



Not furnished (CIP) 



(71) Applicant (for all designated States except US): LYNX THER- 

APEUTICS. INC. [USOJS]; 3832 Bay Center Place, Hay- 
ward, CA 94545 (US). 

(72) Inventor; and 

(75) Inventor/Applicant (for US only)\ MARTIN, David. W. 
(US/USl; Lynx Therapeutics, Inc., 3832 Bay Center Place, 
Hayward. CA 94545 (US). 



(74) Agent: POWERS. Vincent, M.; Dchlingcr & Associates, Post 
Office Box 60850, Palo Alto, CA 94306-0850 (US). 



(81) Designated States: AU, CA, CZ, EE, Fl. HU, JP. KR, LT, LV, 
NO, NZ, PL, RU. SG, US. European patent (AT, BE, CH, 
DE, DK, ES, FI, FR, GB. GR. IE, IT, LU. MC, NL, PT, 
SE). 



Publisbed 

With international search report. 

Before the expiration of the time limit for amending the 
claims and to be republished in the event of the receipt of 
amendments. 



(54) TiUe: MEASUREMENT OF GENE EXPRESSION PROHLES IN TOXICITY DETERMINATION 
(57) Abstract 

A method is provided for assessing the toxicity of a compound in a test organism by measuring gene expression profiles of selected 
tissues. Gene expression profiles are measured by massively parallel signanire sequencing of cDNA libraries consmicted from mRNA 
extracted from the selected tissues. Gene expression profiles provide extensive infonnation on the effects of administering a compound to 
a test organism in both acute toxicity tests and in prolonged and chronic toxicity tests. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international 
applications under the PCT. 



AM 


Armenia 


AT 


Austria 


AU 


Australia 


BB 


Bartudos 


BE 


Belgium 


BF 


Burkina Faso 


BG 


Bulgaria 


BJ 


Benin 


BR 


Brazil 


BY 


Be lams 


CA 


Canada 


CF 


Central African Republic 


CG 


Congo 


CH 


Switzerland 


CI 


C«te d'lvoire 


CM 


Cameroon 


CN 


China 


CS 


Czochoslovalcia 


C2 


Czech Republic 


D£ 


Germany 


DK 


Denmark 


EE 


Estonia 


ES 


Spain 


FI 


Finland 


FR 


France 


CA 


Gabon 



GB 


United Kingdom 


GE 


Gcofgia 


GN 


Guinea 


GR 


Greece 


HU 


Hungary 


IE 


Ireland 


IT 


Italy 


JP 


Japan 


KE 


Kenya 


KG 


Kyrgystan 


KF 


Democratic People's Republic 




of Korea 


KR 


Republic of Korea 


KZ 


Kazakhstan 


U 


Liechtenstein 


LK 


Sri Lanka 


LR 


Liberia 


LT 


Lithuania 


LU 


Luxembourg 


LV 


Latvia 


MC 


Monaco 


MD 


Republic of Moklova 


MG 


Madagascar 


ML 


Mali 


MN 


Mongolia 


MR 


Mauritania 



MW 


Malawi 


MX 


Mexico 


NE 


Niger 


NL 


Netherlands 


NO 


Norway 


NZ 


New Zealand 


PL 


Poland 


FT 


Portugal 


RO 


Romania 


RU 


Rtissian Federation 


SD 


Sudan 


SE 


Sweden 


5G 


Singapore 


51 


Slovenia 


SK 


Slovakia 


SN 


Senegal 


sz 


Swaziland 


TD 


Chad 


TG 


Togo 


TJ 


Tajikistan 


TT 


Trinidad and Tc^ago 


UA 


Ukraine 


VC 


Ugarxla 


US 


United States of America 


uz 


Uzbekistan 


VN 


Viet Nam 



wo 97/13877 PCT/US96/16342 



MEASUREMENT OF GENE EXPRESSION PROFILES 
IN TOXICITY DETERMINATION 

5 Field of the Invention 

The invention relates generally to methods for detecting and monitoring 
phenotypic changes in in vitro and in vivo systems for assessing and/or determining 
the toxicity of chemical compounds, and more particularly, the invention relates to a 
method for delecting and monitoring changes in gene expression patterns in in vitro 
1 0 and in vivo systems for determining the toxicity of drug candidates. 

BACKGROUND 

The ability to rapidly and conveniently assess the toxicity of new compounds 
is extremely important. Thousands of new compounds are synthesized every year, 

15 and many are introduced l^^ the'developmeht of new 

commercial products and processes, often with little knowledge of their short term 
and long term health effects. In the development of new drugs, the cost of assessing 
the safety and efficacy of candidate compounds is becoming astronomical: It is 
.estimated that the pharmaceutical industry spends an average of about 300 million 

20 dollars to bring a new pharmaceutical compound to market, e.g. Biotechnology, 13: 
226-228 (1995). A large fraction of these costs are due to the failure of candidate 
compounds in the later stages of the developmental process. That is, as the 
assessment of a candidate drug progresses from the identification of a compound as a 
drug candidate—for example, through relatively inexpensive binding assays or in vitro 

25 screening assays, to pharmacokinetic studies, to toxicity studies, to efficacy studies in 
model systems, to preliminary clinical studies, and so on, the costs of the associated 
tests and analyses increases tremendously. Consequently, it may cost several tens of 
millions of dollars to determine that a once promising candidate compound possesses 
a side effect or cross reactivity that renders it commercially infeasible to develop 

30 further. A great challenge of pharmaceutical development is to remove from further 
consideration as early as possible those compounds that are likely to fail in the later 
stages of drug testing. 

Drug development prograrr s are clearly structured with this objective in mind; 
however, rapidly escalating costs have created a need to develop even more stringent 

35 and less expensive screens in the early stages to identify false leads as soon as 

possible. Toxicity assessment is an area where such improvements may be made, for 
both drug development and for assessing the environmental, health, and safety effects 
of new compounds in general. 



-1 - 



wo 97/13877 



PCT/US96/16342 



Typically the toxicity of a compound is determined by administering the 
compound to one or more species of test animal under controlled conditions and by 
monitoring the effects on a wide range of parameters. The parameters include such 
things as blood chemistry, weight gain or loss, a variety of behavioral patterns, muscle 
tone, body temperature, respiration rate, lethality, and the like, which collectively 
provide a measure of the state of health of the test animal. The degree of deviation of 
such parameters from their normal ranges gives a measure of the toxicity of a 
compound. Such tests may be designed to assess the acute, prolonged, or chronic 
toxicity of a compound. In general, acute tests involve administration of the test 
chemical on one occasion. The period of observation of the test animals may be as 
short as a few hours, although it is usually at least 24 hours and in some cases it may 
be as long as a week or more. In general, prolonged tests involve administration of 
the test chemical on multiple occasions. The test chemical may be administered one 
or more times each day, irregularly as when it is incorporated in the diet, at specific 
rtinies such ais darihg-pre^^ 

intervals. Also, in the prolonged test the experiment is usually conducted for not less 
than 90 days in the rat or mouse or a year in the dog. In contrast to the acute and 
prolonged types of test, the chronic toxicity tests are those in which the test chemical 
.is administered for a substantial portion of the lifetime of the test animal. In the case 
of the mouse or rat, this is a period of 2 to 3 years. In the case of the dog, it is for 5 to 
7 years. 

Significant costs are incurred in establishing and maintaining large cohorts of 
test animals for such assays, especially the larger animals in chronic toxicity assays. 
Moreover, because of species specific effects, passing such toxicity tests does not 
ensure that a compound is free of toxic effects when used in humans. Such tests do, 
however, provide a standardized set of information forjudging the safety of new 
compounds, and they provide a database for giving preliminary assessments of related 
compounds. An important area for improving toxicity determination would be the 
identification of new observables which are predictive of the outcome of the 
expensive and tedious animal assays. 

In other medical fields, there has been significant interest in applying recent 
advances in biotechnology, particularly in DNA sequencing, to the identification and 
study of differentially expressed genes in healthy and diseased organisms, e.g. Adams 
et al. Science, 252: 1651-1656 (1991); Matsubara et al. Gene, 135: 265-274 (1993); 
Rosenberg et al. International patent application, PCT/US95/01 863. The objectives 
of such applications include increasing our knowledge of disease processes, 
identifying genes that play important roles in the disease process, and providing 
diagnostic and therapeutic approaches that exploit the expressed genes or their 



wo 97/13877 



PCT/US96/16342 



products. While such approaches are attractive, those based on exhaustive, or even 
sampled, sequencing of.expressed genes are still beset by the enormous effort 
required: It is estimated that 30-35 thousand different genes are expressed in a typical 
mammaHan tissue in any given state, e.g. Ausubel et al. Editors, Current Protocols, 
5 5.8.1-5.8.4 (John Wiley & Sons, New York, 1992). Determining the sequences of 
even a small sample of that number of gene products is a major enterprise, requiring 
industrial-scale resources. Thus, the routine application of massive sequencing of 
expressed genes is still beyond current commercial technology. 

The availability of new assays for assessing the toxicity of compounds, such 
1 0 as candidate drugs, that would provide more comprehensive and precise information 
about the state of health of a test animal would be highly desirable. Such additional 
assays would preferably be less expensive, more rapid, and more convenient than 
current testing procedures, and would at the same time provide enough information to 
make early judgments regarding the safety of new compounds. 

15 * 

Summary of the Invention 
An object of the invention is to provide a new approach to toxicity assessment 
based on an examination of gene expression patterns, or profiles, in in vitro or in vivo 
. test systems. 

20 Another object of the invention is to provide a database on which to base 

decisions concerning the toxicological properties of chemicals, particularly drug 
candidates. 

A further object of the invention is to provide a method for analyzing gene 
expression patterns in selected tissues of test animals. 
25 A still further object of the invention is to provide a system for identifying 

genes which are differentially expressed in response to exposure to a test compound. 

Another object of the invention is to provide a rapid and reliable method for 
correlating gene expression with short term and long term toxicity in test animals. 

Another object of the invention is to identify genes whose expression is 
30 predictive of deleterious toxicity. 

The invention achieves these and other objects by providing a method for 
massively parallel signature sequencing of genes expressed in one or more selected 
tissues of an organism exposed to a test compound. An important feature of the 
invention is the application of novel DNA sorting and sequencing methodologies that 
35 permit the formation of gene expression profiles for selected tissues by determining 
the sequence of portions of many thousands of different polynucleotides in parallel. 
Such profiles may be compared with those from tissues of control organisms at single 
or multiple time points to identify expression pattems predictive of toxicit\'. 



"3- 



wo 97/13877 



PCT/US96/16342 



The sorting methodology of the invention makes use of oligonucleotide tags 
that are members of a minimally cross-hybridizing set of oligonucleotides. The 
sequences of oligonucleotides of such a set differ from the sequences of every other 
member of the same set by at least two nucleotides. Thus, each member of such a set 
cannot form a duplex (or triplex) with the complement of any other member with less 
than two mismatches. Complements of oligonucleotide tags of the invention, referred 
to herein as "tag complements," may comprise natural nucleotides or non-natural 
nucleotide analogs. Preferably, tag complements are attached to solid phase supports. 
Such oligonucleotide tags when used with their corresponding tag complements 
provide a means of enhancing specificity of hybridization for sorting polynucleotides, 
such as cDNAs. 

The polynucleotides to be sorted each have an oligonucleotide tag attached, 
such that different polynucleotides have different tags. As explained more fully 
below, this condition is achieved by employing a repertoire of tags substantially 

greater man the population of polynucleotides and by taking a sufficiently small 
sample of tagged polynucleotides from the full ensemble of tagged polynucleotides. 
After such sampling, when the populations of supports and polynucleotides are mixed 
under conditions which permit specific hybridization of the oligonucleotide tags with 
-their respective complements, identical polynucleotides sort onto particular beads or 
regions. The sorted populations of polynucleotides can then be sequenced on the 
solid phase support by a "single-base" or "base-by-base" sequencing methodology, as 
described more fiilly below. 

In one aspect, the method of the invention comprises the following steps: (a) 
administering the compound to a test organism; (b) extracting a population of mRNA 
molecules from each of one or more tissues of the test organism; (c) forming a 
separate population of cDNA molecules from each population of mRNA molecules 
extracted from the one or more tissues such that each cDNA molecule of the separate 
populations has an oligonucleotide tag attached, the oligonucleotide tags being 
selected from the same minimally cross-hybridizing set; (d) separately sampling each 
population of cDNA molecules such that substantially all different cDNA molecules 
within a separate population have different oligonucleotide tags attached; (e) sorting 
the cDNA molecules of each separate population by specifically hybridizing the 
oligonucleotide tags with their respective complements, the respective complements 
being attached as uniform populations of substantially identical complements in 
spatially discrete regions on one or more solid phase supports; (0 determining the 
nucleotide sequence of a portion of each of the sorted cDNA molecules of each 
separate population to form a frequency distribution of expressed genes for each of 



wo 97/13877 



PCT/US96/16342 



the one or more tissues; and (g) correlating the frequency distribution of expressed 
genes in each of the one or more tissues with the toxicity of the compound. 

An important aspect of the invention is the identification of genes whose 
expression is predictive of the toxicity of a compound. Once such genes are 
5 identified, they may be employed in conventional assays, such as reverse transcriptase 
polymerase chain reaction (RT-PCR) assays for gene expression. 

Brief Description of the Drawings 
Figure 1 is a flow chart representation of an algorithm for generating 
1 0 minimally cross-hybridizing sets of oligonucleotides. 

Figm*e 2 diagrammatically illustrates an apparatus for carrying out 
polynucleotide sequencing in accordance with the invention. 

Definitions 

1 5 "Complement" or "tag complement" as used herein in reference to 

oligonucleotide tags refers to an oligonucleotide to which a oligonucleotide tag 
specifically hybridizes to form a perfectly matched duplex or triplex. In embodiments 
where specific hybridization results in a triplex, the oligonucleotide tag may be 
^selected to be either double stranded or single stranded. Thus, where triplexes are 

20 fonned, the term "complement" is meant to encompass either a double stranded 

complement of a single stranded oligonucleotide tag or a single stranded complement 
of a double stranded oligonucleotide tag. 

The term "oligonucleotide" as used herein includes linear oligomers of natural 
or modified monomers or linkages, including deoxyribonucleosides, ribonucleosides, 

25 anomeric forms thereof, peptide nucleic acids (PNAs), and the like, capable of 
specifically binding to a target polynucleotide by way of a regular pattern of 
monomer-to-monomer interactions, such as Watson-Crick type of base pairing, base 
stacking, Hoogsteen or reverse Hoogsteen types of base pairing, or the like. Usually 
monomers are linked by phosphodiester bonds or analogs thereof to form 

30 oligonucleotides ranging in size fi"om a few monomeric imits, e.g. 3-4, to several tens 
of monomeric units. Whenever an oligonucleotide is represented by a sequence of 
letters, such as "ATGCCTG," it will be understood that the nucleotides are in 5'—>3' 
order from left to right and that "A" denotes deoxyadenosine, "C" denotes 
deoxycytidine, "G" denotes deoxyguanosine, and "T" denotes thymidine, unless 

35 otherwise noted. Analogs of phosphodiester linkages include phosphorothioate, 
phosphorodithioate, phosphoranilidate, phosphoramidaie, and the like. Usually 
oligonucleotides of the invention comprise the four natural nucleotides; however, they 
may also comprise non-natural nucleotide analogs. It is clear to those skilled in the 



-5- 



wo 97/13877 PCT/US96/16342 

art when oligonucleotides having natural or non-natural nucleotides may be 
employed, e.g. where processing by enzymes is called for, usually oligonucleotides 
consisting of natural nucleotides are required. 

"Perfectly matched" in reference to a duplex means that the poly- or 
5 oligonucleotide strands making up the duplex fomi a double stranded structure with 
one other such that every nucleotide in each strand undergoes Watson-Crick 
basepairing with a nucleotide in the other strand. The term also comprehends the 
pairing of nucleoside analogs, such as deoxyinosine, nucleosides v«th 2-aminopurine 
bases, and the like, that may be employed. In reference to a triplex, the term means 
1 0 that the triplex consists of a perfectly matched duplex and a third strand in which 
every nucleotide undergoes Hoogsteen or reverse Hoogsteen association with a 
basepair of the perfectly matched duplex. Conversely, a "mismatch" in a duplex 
between a tag and an oligonucleotide means that a pair or triplet of nucleotides in the 
duplex or triplex fails to undergo Watson-Crick and/or Hoogsteen and/or reverse 

MS''" ilo'ogsiefh'bonding; — 

As used herein, "nucleoside" includes the natural nucleosides, including 2'- 
deoxy and 2'-hydroxyl forms, e.g. as described in Komberg and Baker, DNA 
Replication, 2nd Ed. (Freeman, San Francisco. 1992). "Analogs" in reference to 
. nucleosides includes synthetic nucleosides having modified base moieties and/or 

20 modified sugar moieties, e.g. described by Scheit, Nucleotide Analogs (John Wiley, 
New York, 1980); Uhlman and Peyman, Chemical Reviews, 90: 543-584 (1990). or 
the like, with the only proviso that they are capable of specific hybridization. Such 
analogs include synthetic nucleosides designed to enhance binding properties, reduce 
complexity, increase specificity, and the like. 

25 As used herein "sequence determination" or "determining a nucleotide 

sequence" in reference to polynucleotides includes determination of partial as well as 
full sequence information of the polynucleotide. That is, the term includes sequence 
comparisons, fingerprinting, and like levels of information about a target 
polynucleotide, as well as the express identification and ordering of nucleosides, 

30 usually each nucleoside, in a target polynucleotide. The term also includes the 

determination of the identification, ordering, and locations of one, two, or three of the 
four types of nucleotides within a target polynucleotide. For example, in some 
embodiments sequence determination may be effected by identifying the ordering and 
locations of a single type of nucleotide, e.g. cytosines, within the target polynucleotide 

35 "CATCGC ..." so that its sequence is represented as a binary code, e.g. "100101 ... " for 
"C-(not C)-(not C)-C-(not C)-C ... " and the like. 



-6- 



wo 97/13877 



PCT/US96/16342 



As used herein, the term "complexity" in reference to a population of 
polynucleotides means .the number of different species of molecule present in the 
population. 

As used herein, the terms "gene expression profile," and "gene expression 
5 pattern" which is used equivalently, means a frequency distribution of sequences of 
portions of cDNA molecules sampled from a population of tag-cDNA conjugates. 
Generally, the portions of sequence are sufficiently long to uniquely identify the 
cDNA from which the portion arose. Preferably, the total niunber of sequences 
determined is at least 1 000; more preferably, the total number of sequences 
10 determined in a gene expression profile is at least ten thousand. 

As used herein, "test organism" means any in vitro or in vivo system which 
provides measureable responses to exposure to test compounds. Typically, test 
organisms may be mammaUan cell cultures, particularly of specific tissues, such as 
hepatocytes, neurons, kidney cells, colony forming cells, or the like, or test organisms 
1 5 may be whole animals, such as rats, mice, hamsters, guinea pigs, dogs, cats, rabbits, 
pigs, monkeys, and the like. 

Detailed Description of the Invention 
The invention provides a method for determining the toxicity of a compound 

20 by analyzing changes in the gene expression profiles in selected tissues of test 
organisms exposed to the compound. The invention also provides a method of 
identifying toxicity markers consisting of individual genes or a group of genes that is 
expressed acutely and which is correlated with prolonged or chronic toxicity, or 
suggests that the compound will have an undesirable cross reactivity. Gene 

25 expression profiles are generated by sequencing portions of cDNA molecules 

construction from mRNA extracted from tissues of test organisms exposed to the 
compound being tested. As used herein, the term "tissue" is employed with its usual 
medical or biological meaning, except that in reference to an in vitro test system, such 
as a cell culture, it simply means a sample from the culture. Gene expression profiles 

30 derived from test organisms are compared to gene expression profiles derived from 
control organisms to determine the genes which are differentially expressed in the test 
organism because of exposure to the compound being tested. In both cases, the 
sequence information of the gene expression profiles is obtained by massively parallel 
signature sequencing of cDNAs, which is implemented in steps (c) through (f) of the 

35 above method. 

Toxicity Assessment 
Procedures for designing and conducting toxicity tests in in vitro and in vivo 
systems is well known, and is described in many texts on the subject, such as Loomis 



wo 97/13877 



PCT/US96/16342 



et al. Loomis's Esstentials of Toxicology, 4th Ed. (Academic Press, New York, 1996); 
Echobichon, The Basics of Toxicity Testing (CRC Press, Boca Raton, 1992); Frazier, 
editor, In Vitro Toxicity Testing (Marcel Dekker, New York, 1992); and the like. 

In toxicity testing, two groups of test organisms are usually employed: one 
group serves as a control and the other group receives the test compound in a single 
dose (for acute toxicity tests) or a regimen of doses (for prolonged or chronic toxicity 
tests). Since in most cases, the extraction of tissue as called for in the method of the 
invention requires sacrificing the test animal, both the control group and the group 
receiving compound must be large enough to permit removal of animals for sampling 
tissues, if it is desired to observe the dynamics of gene expression through the 
duration of an experiment. 

In setting up a toxicity study, extensive guidance is provided in the literature 
for selecting the appropriate test organism for the compound being tested, route of 
administration, dose ranges, and the like. Water or physiological saline (0.9% NaCI 
m water) is the solute of choice for the test compound since these solvents permit 
administration by a variety of routes. When this is not possible because of solubility 
limitations, it is necessary to resort to the use of vegetable oils such as com oil or 
even organic solvents, of which propylene glycol is commonly used. Whenever 
.possible the use of suspension of emulsion should be avoided except for oral 
administration. Regardless of the route of administration, the volume required to 
administer a given dose is limited by the size of the animal that is used. It is desirable 
to keep the volume of each dose uniform within and between groups of animals. 
When rates or mice are used the volume administered by the oral route should not 
exceed 0.005 ml per gram of animal. Even when aqueous or physiological saline 
solutions are used for parenteral injection the volumes that are tolerated are limited, 
although such solutions are ordinarily thought of as being innocuous. The 
intravenous LD50 of distilled water in the mouse is approximately 0.044 ml per gram 
and that of isotonic saline is 0.068 ml per gram of mouse. 

When a compound is to be administered by inhalation, special techniques for 
generating test atmospheres are necessary. Dose estimation becomes very 
complicated. The methods usually involve aerosolization or nebulization of fluids 
containing the compound. If the agent to be tested is a fluid that has an appreciable 
vapor pressure, it may be administered by passing air through the solution under 
controlled temperature conditions. Under these conditions, dose is estimated from the 
volume of air inhaled per unit time, the temperature of the solution, and the vapor 
pressure of the agent involved. Gases are metered from reservoirs. When particles of 
a solution are to be administered, unless the particle size is less than about 2 nm the 
particles will not reach the terminal alveolar sacs in the lungs. A variety of 



wo 97/13877 PCTAJS96/16342 

apparatuses and chambers are available to perform studies for detecting effects of 
irritant or other toxic endpoints when they are administered by inhalation. The 
prefened method of administering an agent to animals is via the oral route, either by 
intubation or by incorporating the agent in the feed. 
5 Preferably, in designing a toxicity assessment, two or more species should be 

employed that handle the test compound as similarly to man as possible in terms of 
metabolism, absorption, excretion, tissue storage, and the like. Preferably, multiple 
doses or regimens at different concentrations should be employed to establish a dose- 
response relationship with respect to toxic effects. And preferably, the route of 

1 0 administration to the test animal should be the same as, or as similar as possible to, 
the route of administration of the compound to man. Effects obtained by one route of 
administration to test animals are not a priori applicable to effects by another route of 
administration to man. For example, food additives for man should be tested by 
admixture of the material in the diet of the test animals. 

1 5 Acute toxicity tests consist of administering a compound to test organisms on 

one occasion. The purpose of such test is to determine the symptomotology 
consequent to administration of the compound and to determine the degree of lethality 
of the compound. The initial procedure is to perform a series of range-finding doses 
jof the compound in a single species. This necessitates selection of a route of 

20 administration, preparation of the compound in a form suitable for administration by 
the selected route, and selection of an appropriate species. Preferably, initial acute 
toxicity studies are performed on either rats or mice because of their low cost, their 
availability, and the availability of abundant toxicologic reference data on these 
species. Prolonged toxicity tests consist of administering a compound to test 

25 organisms repeatedly, usually on a daily basis, over a period of 3 to 4 months. Two 
practical factors are encountered that place constraints on the design of such tests: 
First, the available routes of administration are limited because the route selected 
must be suitable for repeated administration without inducing harmful effects. And 
second, blood,- urine, and perhaps other samples, should be taken repeatedly without 

30 inducing significant harm to the test animals. Preferably, in the method of the 
invention the gene expression profiles are obtained in conjunction with the 
measurement of the traditional toxicologic parameters, such as listed in the table 
below: 



35 



-9- 



wo 97/13877 



PCT/US96/16342 



Hematolo 




erythrocyte count 
total leukocyte count 
differential leukocyte 
count 
hematocrit 
hemoglobin 



calcium 

carbon dioxide 

serum glutamine-pyruvate 

transaminase 

serum glutamin-oxalacetic 
transaminase 
serum protein 
electrophoresis 
blood sugar 
blood urea nitrogen 
- tota I -serum- prote in**--- - 
serum albumin 
total serum bilirubin 



sodium pH 
potassium specific gravity 

chloride total protein 



sediment 

glucose 

ketones 



bilirubin 



Oligonucle otide Taps and Tag Complements 



10 



15 



Oligonucleotide tags are members of a minimally cross-hybridizing set of 
oligonucleotides. The sequences of oligonucleotides of such a set differ from the 
sequences of every other member of the same set by at least two nucleotides. Thus, 
each member of such a set cannot form a duplex (or triplex) with the complement of 
any other member with less than two mismatches. Complements of oligonucleotide 
tags, referred to herein as '^tag complements," may comprise natural nucleotides or 
non-natural nucleotide analogs. Preferably, tag complements are attached to solid 
phase supports. Such oligonucleotide tags when used with their corresponding tag 
complements provide a means of enhancing specificity of hybridization for sorting, 
tracking, or labeling molecules, especially polynucleotides. 

Minimally cross-hybridizing sets of oligonucleotide tags and tag complements 
may be synthesized either combinatorially or individually depending on the size of the 
set desired and the degree to which cross-hybridization is sought to be minimized (or 
stated another way, the degree to which specificity is sought to be enhanced). For 
example, a minimally cross-hybridizing set may consist of a set of individually 
synthesized 1 0-mer sequences that differ from each other by at least 4 nucleotides, 
such set having a maximum size of 332 (when composed of 3 kinds of nucleotides 
and counted using a computer program such as disclosed in Appendix Ic). 
Alternatively, a minimally cross-hybridizing set of oligonucleotide tags may also be 



-10- 



I 



WO 97/13877 PCT/US96/16342 

assembled combinatorially from subunits which themselves are selected from a 
minimally cross-hybridizing set. For example, a set of minimally cross-hybridizing 
1 2-mers differing from one another by at least three nucleotides may be synthesized 
by assembling 3 subunits selected from a set of minimally cross-hybridizing 4-mers 
5 that each differ from one another by three nucleotides. Such an embodiment gives a 
maximally sized set of 9 , or 729, 12-mers. The number 9 is number of 
oligonucleotides listed by the computer program of Appendix la, which assumes, as 
with the 10-mers, that only 3 of the 4 different types of nucleotides are used. The set 
is described as "maximal" because the computer programs of Appendices la-c provide 

1 0 the largest set for a given input (e.g. length, composition, difference in number of 
nucleotides between members). Additional minimally cross-hybridizing sets may be 
formed from subsets of such calculated sets. 

Oligonucleotide tags may be single stranded and be designed for specific 
-..hylpridizajtipn to single strjindgd tog compjernem^ formation or for specific 

1 5 hybridization to double stranded tag complements by triplex formation. 

Oligonucleotide tags may also be double stranded and be designed for specific 
hybridization to single stranded tag complements by triplex formation. 

When synthesized combinatorially, an oligonucleotide tag preferably consists 
-of a plurality of subimits, each subunit consisting of an oligonucleotide of 3 to 9 

20 nucleotides in length wherein each subunit is selected from the same minimally cross- 
hybridizing set. In such embodiments, the number of oligonucleotide tags available 
depends on the number of subimits per tag and on the length of the subunits. The 
number is generally much less than the number of all possible sequences the length of 
the tag, which for a tag n nucleotides long would be 4". 

25 Complements of oligonucleotide tags attached to a solid phase support are 

used to sort polynucleotides from a mixture of polynucleotides each containing a tag. 
Complements of the oligonucleotide tags are synthesized on the surface of a solid 
phase support, such as a microscopic bead or a specific location on an array of 
synthesis locations on a single support, such that populations of identical sequences 

30 are produced in specific regions. That is, the surface of each support, in the case of a 
bead, or of each region, in the case of an array, is derivatized by only one type of 
complement which has a particular sequence. The population of such beads or regions 
contains a repertoire of complements with distinct sequences. As used herein in 
reference to oligonucleotide tags and tag complements, the term ''repertoire" means 

35 the set of minimally cross-hybridizing set of oligonucleotides that make up the tags in 
a particular embodiment or the corresponding set of tag complements. 

The polynucleotides to be sorted each have an oligonucleotide tag attached, 
such that different polynucleotides have different tags. As explained more fully 

-11 - 



t 



wo 97/13877 



PCT/US96/16342 



10 



iT 



below, this condition is achieved by employing a repertoire of tags substantially 
greater than the population of polynucleotides and by taking a sufficiently small 
sample of tagged polynucleotides from the full ensemble of tagged polynucleotides. 
After such sampling, when the populations of supports and polynucleotides are mixed 
under conditions which permit specific hybridization of the oligonucleotide tags with 
their respective complements, identical polynucleotides sort onto particular beads or 
regions. 

The nucleotide sequences of oligonucleotides of a minimally cross-hybridizing 
set are conveniently enumerated by simple computer programs, such as those 
exemplified by programs whose source codes are listed in Appendices la and lb. 
Program minhx of Appendix la computes all minimally cross-hybridizing sets having 
4-mer subuniis composed of three kinds of nucleotides. Program tagN of Appendix 
lb enumerates longer oligonucleotides of a minimally cross-hybridizing set. Similar 
algorithms and computer programs are readily written for listing oligonucleotides of 
minimally cross-hybridizing sets for any embodiment of the invention. Table I below 
provides guidance as to the size of sets of minimally cross-hybridizing 
oligonucleotides for the indicated lengths and number of nucleotide differences. The 
above computer programs were used to generate the numbers. 



20 



Table I 



Oligonucleotid 
c 

Word 
Lengih 



Nucleotide 
Difference 
between 
Oligonucleotides 
of Minimally 
Cross- 
Hybridizing Set 



Maximal Size 
of Minimally 

Cross- 
Hybridizing 
Set 



Size of 
Repertoire 
with Four 

Words 



Size of 
Repertoire with 
Five Words 



4 


3 


9 


6561 


5.90 X 10^ 


6 


3 


27 


5.3 X 10*^ 


1.43 X 10^ 


7 


4 


27 


5.3 X 10^ 


1.43 X 10^ 


7 


5 


8 


4096 


3.28 X lO"* 


8 


3 


190 


K30x 10^ 


2.48 X lo" 


8 


4 


62 


1.48x10^ 


9.16 X 10* 


8 


5 


18 


1.05x10^ 


1.89 X 10^ 


9 


5 


39 


2.31 X 10^ 


9.02 X 10' 


10 


5 


332 


1.21 X lo'^ 




10 


6 


28 


6.15 X 10- 


1.72 X 10^ 


11 


5 


187 






18 


6 


«25000 










-12- 







I 



wo 97/13877 PCT/US96/16342 



18 12 24 

For some embodiments of the invention, where extremely large repertoires of 
tags are not required, oligonucleotide tags of a minimally cross-hybridizing set may 
be separately synthesized. Sets containing several hundred to several thousands, or 
5 even several tens of thousands, of oligonucleotides may be synthesized directly by a 
variety of parallel synthesis approaches, e.g. as disclosed in Frank ei al, U.S. patent 
4,689,405; Frank et al. Nucleic Acids Research, 1 1 : 4365-4377 (1983); Malson et al, 
Anal. Biochem.- 224: 1 10-1 16 (1995); Fodor et al, International appHcation 
PCT/US93/04145; Pease et al, Proc. Natl. Acad. Sci., 91 : 5022-5026 (1994); 
10 Southern et al, J. Biotechnology, 35: 217-227 (1994), Brennan, International 

application PCT/US94/05896; Lashkari et al, Proc. Natl. Acad. Sci., 92: 7912-7915 
(1995); or the like. 

__^^...J?Xefeca!?Jy joli^^^^ ^ 

combinatorially out of subunits between three and six nucleotides in length and 
1 5 selected from the same minimally cross-hybridizing set. For oligonucletides in this 
range, the members of such sets may be enumerated by computer programs based on 
the algorithm of Fig. 1 . 

The algorithm of Fig. 1 is implemented by first defining the characteristics of 
the subunits of the minimally cross-hybridizing set, i.e. length, number of base 
20 differences between members, and composition, e.g. do they consist of two, three, or 
four kinds of bases. A table M^, n=l , is generated (1 00) that consists of all possible 
sequences of a given length and composition. An initial subunit S\ is selected and 
compared (120) with successive subunits Sj for i=n+l to the end of the table. 
Whenever a successive subunit has the required number of mismatches to be a 
25 member of the minimally cross-hybridizing set, it is saved in a new table Mj^+j (125), 
that also contains subunits previously selected in prior passes through step 120. For 
example, in the first set of comparisons, M2 will contain S j ; in the second set of 
comparisons, M3 will contain Sj and S2; in the third set of comparisons, M4 will 
contain Sj, S2, and S3; and so on. Similarly, comparisons in table Mj will be 
30 between Sj and all successive subunits in Mj. Note that each successive table Mj|+i 
is smaller than its predecessors as subunits are eliminated in successive passes 
through step 1 30. After every subunit of table has been compared (140) the old 
table is replaced by the new table M^-i-] , and the next round of comparisons are 
begun. The process stops (160) when a table is reached that contains no 
35 successive subunits to compare to the selected subunit Sj, i.e. Mn=Mn+i . 

Preferably, minimally cross-hybridizing sets comprise subunits that make 
approximately equivalent contributions to duplex stability as every other subunit in 



13 



wo 97/13877 



PCTAJS96/16342 



0 



the set. In this way, the stability of perfectly matched duplexes between every subunit 
and its complement is approximately equal. Guidance for selecting such sets is 
provided by published techniques for selecting optimal PCR primers and calculating 
duplex stabilities, e.g. Rychlik et al, Nucleic Acids Research, 17: 8543-8551 (1989) 
and 18: 6409-6412 (1990); Breslaueret al, Proc. Natl. Acad. Sci., 83: 3746-3750 
(1986); Wetmur, Crit. Rev. Biochem. Mol. Biol., 26: 227-259 (1991);and the like. 
For shorter tags, e.g. about 30 nucleotides or less, the algorithm described by Rychlik 
and Wetmur is preferred, and for longer tags, e.g. about 30-35 nucleotides or greater, 
an algorithm disclosed by Suggs et al. pages 683-693 in Brown, editor, ICN-UCLA 
Symp. Dev. Biol., Vol. 23 (Academic Press, New York, 1981) may be conveniently 
employed. Clearly, the are many approaches available to one skilled in the art for 
designing sets of minimally cross-hybridizing subunits within the scope of the 
invention. For example, to minimize the affects of different base-stacking energies of 
terminal nucleotides when subunits are assembled, subunits may be provided that 
have the same terminal nucleotides. In this way, when subunits are linked, the sum of 
the base-stacking energies of all the adjoining terminal nucleotides will be the same, 
thereby reducing or eliminating variability in tag melting temperatures. 

A "word" of terminal nucleotides, shown in italic below, may also be added to 
- each end of a tag so that a perfect match is always formed between it and a similar 
terminal "word" on any other tag complement. Such an augmented tag would have 
the form: 



w 


w, 




w,., 




w 


w 


W,' 








w 



where the primed W's indicate complements. With ends of tags always forming 
perfectly matched duplexes, all mismatched words will be internal mismatches 
thereby reducing the stability of tag-complement duplexes that otherwise would have 
mismatched words at their ends. It is well knovm that duplexes with internal 
mismatches are significantly less stable than duplexes with the same mismatch at a 
terminus. 

A preferred embodiment of minimally cross-hybridizing sets are those whose 
subunits are made up of three of the four natural nucleotides. As will be discussed 
more fully below, the absence of one type of nucleotide in the oligonucleotide tags 
permits target polynucleotides to be loaded onto solid phase supports by use of the 
5'->3' exonuclease activity of a DNA polymerase. The following is an exemplary- 
minimally cross-hybridizing set of subunits each comprising four nucleotides selected 
from the group consisting of A, G, and T: 



-14- 



wo 97/13877 PCT/US96/16342 



Table II 



Word: 



Sequence : 



GATT 



W2 
TGAT 



W3 
TAGA 



W4 
TTTG 



Word: 



Sequence: 



W5 



GTAA 



W6 



ACTA 



W7 



ATGT 



AAAG 



In this set, each member would form a duplex having three mismatched bases with 
1 0 the complement of every other member. 

Further exemplary minimally cross-hybridizing sets are listed below in Table 
III. Clearly, additional sets can be generated by substituting different groups of 
'nucleotides, or by using subsets of known minimally cross-hybridizing sets. 



15 



Table III 

Exemplary Minimally Cross-Hvbridizing Sets of 4-mer Subunits 



Set 1 


Set 2 


Set 3 


Set 4 


Set 5 


Set 6 


CATT 


ACCC 


AAAC 


AAAG 


AACA 


AACG 


CTAA 


AGGG 


ACCA 


ACCA 


ACAC 


ACAA 


TCAT 


CACG 


AGGG 


AGGC 


AGGG 


AGGC 


ACTA 


CCGA 


CACG 


CACC 


CAAG 


CAAC 


TACA 


CGAC 


CCGC 


CCGG 


CCGC 


CCGG 


TTTC 


GAGC 


CGAA 


CGAA 


CGCA- 


CGCA 


ATCT 


GCAG 


GAGA 


GAGA 


GAGA 


GAGA 


AAAC 


GGCA 


GCAG 


GCAC 


GCCG 


GCCC 




AAAA 


GGCC 


GGCG 


GGAC 


GGAG 



- 15- 



wo 97/13877 



PCTAJS96/16342 



Set / 

AAGA 

ACAC 

AGCG 

CAAG 

CCCA 

CGGC 

GACC 

GCGG 

GGAA 



Set 8 
AAGC 
ACAA 
AGCG 
CAAG 
CCCC 
CGGA 
GACA 
GCGG 
GGAC 



Set 9 
AAGG 
ACAA 
AGCC 
CAAC 
CCCG 
CGGA 
GACA 
GCGC 
GGAG 



Set 10 
ACAG 
AACA 
AGGC 
CAAC 
CCGA 
CGCG 
GAGG 
GCCC 
GGAA 



Set 11 
ACCG 
AAAA 
AGGC 
CACC 
CCGA 
CGAG 
GAGG 
GCAC 
GGCA 



Set 12 
ACGA 
AAAC 
AGCG 
CACA 
CCAG 
CGGC 
GAGG 
GCCC 
GGAA 



The oligonucleotide tags of the invention and their complements are 
conveniently synthesized on an automated DNA synthesizer, e.g. an Applied 
Biosystems, Inc. (Foster City, California) model 392 or 394 DNA/RNA Synthesizer, 
using standard chemistries, such as phosphoramidite chemistry, e.g. disclosed in the 
following references: Beaucage and Iyer, Tetrahedron, 48: 2223-23 11 (1 992); Moiko 
-^tOT:S: patent 4;9^^^^ Caniihers eV k IJ:S. 

patents 4,415,732; 4,458,066; and 4,973,679; and the like. Alternative chemistries, 
e.g. resulting in non-natural backbone groups, such as phosphorothioate, 
phosphoramidate, and the like, may also be employed provided that the resulting 
.oligonucleotides are capable of specific hybridization. In some embodiments, lags 
may comprise naturally occurring nucleotides that permit processing or manipulation 
by enzymes, while the corresponding tag complements may comprise non-natural 
nucleotide analogs, such as peptide nucleic acids, or like compounds, that promote the 
formation of more stable duplexes during sorting. 

When microparticles are used as supports, repertoires of oligonucleotide tags 
and tag complements may be generated by subunit-wise synthesis via "split and mix" 
techniques, e.g. as disclosed in Shortle et al International patent application 
PCT/US93/03418 or Lyttle et al, Biotechniques, 19: 274-280 (1995). Briefly, the 
basic unit of the synthesis is a subunit of the oligonucleotide tag. Preferably, 
phosphoramidite chemistry is used and 3' phosphoramidite oligonucleotides are 
prepared for each subunit in a minimally cross-hybridizing set, e.g. for the set first 
listed above, there would be eight 4-mer 3'-phosphoramidites. Synthesis proceeds as 
disclosed by Shortle et al or in direct analogy with the techniques employed to 
generate diverse oligonucleotide libraries using nucleosidic monomers, e.g. as 
disclosed in Telenius et al, Genomics, 13: 718-725 (1992); Welsh et al. Nucleic Acids 
Research, 19: 5275-5279 (1991); Grothues et al, Nucleic Acids Research, 21: 1321- 
1322 (1993); Hartley, European patent application 90304496.4; Lam et al, Nature. 
354: 82-84 (1991); Zuckerman et al. Int. J, Pept. Protein Research, 40: 498-507 
(1992); and the like. Generally, these techniques simply call for the application of 



-16- 



1 



wo 97/13877 PCTAJS96/16342 

mixtures of the activated monomers to the growing oligonucleotide during the 
coupling steps. Preferably, oligonucleotide tags and lag complements are synthesized 
on a DNA synthesizer having a number of synthesis chambers which is greater than or 
equal to the number of different kinds of words used in the construction of the tags. 
5 That is, preferably there is a synthesis chamber corresponding to each type of word. 
In this embodiment, words are added nucleotide-by-nucleotide, such that if a word 
consists of five nucleotides there are five monomer couplings in each synthesis 
chamber. After a word is completely synthesized, the synthesis supports are removed 
from the chambers, mixed, and redistributed back to the chambers for the next cycle 
1 0 of word addition. This latter embodiment takes advantage of the high coupling yields 
of monomer addition, e.g. in phosphoramidite chemistries. 

Double stranded forms of tags may be made by separately synthesizing the 
complementary strands followed by mixing imder conditions that permit duplex 
/onTiaUon. Alteniatiyely^. double stranded, tags.may^ first synthesizing a , 

15 single stranded repertoire linked to a known oligonucleotide sequence that serves as a 
primer binding site. The second strand is then synthesized by combining the single 
stranded repertoire with a primer and extending with a polymerase. This latter 
approach is described in Oliphant et al. Gene, 44: 177-183 (1986). Such duplex tags 
- may then be inserted into cloning vectors along with target polynucleotides for sorting 
20 and manipulation of the target polynucleotide in accordance with the invention. 

When tag complements are employed that are made up of nucleotides that 
have enhanced binding characteristics, such as PNAs or oligonucleotide N3'-^P5' 
phosphoramidates, sorting can be implemented through the formation of D-loops 
between tags comprising natural nucleotides and their PNA or phosphoramidate 
25 complements, as an alternative to the ''stripping" reaction employing the 3 *^5' 
exonuclease activity of a DNA polymerase to render a tag single stranded. 

Oligonucleotide tags of the invention may range in length from 12 to 60 
nucleotides or basepairs. Preferably, oligonucleotide tags range in length from 1 8 to 
40 nucleotides or basepairs. More preferably, oligonucleotide tags range in length 
30 from 25 to 40 nucleotides or basepairs. In terms of preferred and more preferred 
numbers of subunits, these ranges may be expressed as follows: 

Table IV 

Numbers of Subunits in Tags in Preferred Embodiments 
Monomers 

in Subunit Nucleotides in Oligonucleotide Tag 

(12-60) (18-40) (25-40) 



35 



-17- 



wo 97/13877 



PCTAJS96/16342 



3 4-20 subunits 6-13 subunits 8-13 subunits 

4 3-15 subunits 4- 1 0 subunits 6- 1 0 subunits 

5 2-12 subunits 3-8 subunits 5-8 subunits 

6 2-10 subunits 3-6 subunits 4-6 subunits 

Most preferably, oligonucleotide tags are single stranded and specific hybridization 
occurs via Watson-Crick pairing with a tag complement. 

Preferably, repertoires of single stranded oligonucleotide tags of the invention 
5 contain at least 1 00 members; more preferably, repertoires of such tags contain at 
least 1 000 members; and most preferably, repertoires of such tags contain at least 
10,000 members. 

Triplex Tags 

- * 'Ii^^J^bodimentS'where specific hybridization occurs via triplex fom^ ' 
0 coding of tag sequences follows the same principles as for duplex-forming tags; 
however, there are further constraints on the selection of subunit sequences. 
Generally, third sUand association via Hoogsteen type of binding is most stable along 
homopyrimidine-homopurine tracks in a double stranded target. Usually, base triplets 
form in T-A*T or C-G*C motifs (where "-" indicates Watson-Crick pairing and 
5 indicates Hoogsteen type of binding); however, other motifs are also possible. For 
example, Hoogsteen base pairing permits parallel and antiparallel orientations 
between the third strand (the Hoogsteen strand) and the purine-rich strand of the 
duplex to which the third strand binds, depending on conditions and the composition 
of the strands. There is extensive guidance in the literature for selecting appropriate 
0 sequences, orientation, conditions, nucleoside type (e.g. whether ribose or 

deoxyribose nucleosides are employed), base modifications (e.g. methylated cytosine. 
and the like) in order to maximize, or otherwise regulate, triplex stability as desired in 
particular embodiments, e.g. Roberts et al, Proc. Natl. Acad. Sci., 88: 9397-9401 
(1991); Roberts et al, Science, 258: 1463-1466 (1992); Roberts et al, Proc. Natl. 
Acad. Sci., 93: 4320-4325 (1996); Distefano et al, Proc. Natl. Acad. Sci., 90: 1 179- 
1 183 (1993); Mergny el al, Biochemistry, 30: 9791-9798 (1991); Cheng et al, J. Am. 
Chem. Soc, 1 14: 4465-4474 (1992); Beal and Dervan, Nucleic Acids Research, 20: 
2773-2776 (1992); Beal and Dervan, J, Am. Chem. Soc, 114: 4976-4982 (1992); 
Giovannangeli et al, Proc. Natl. Acad. Sci., 89: 8631-8635 (1992); Moser and Dervan, 
Science, 238: 645-650 (1987); McShan et al, J. Biol. Chem., 267:5712-5721 (1992); 
Yoon et al, Proc. Natl. Acad. Sci., 89: 3840-3844 (1992); Blume et al. Nucleic Acids 
Research, 20: 1777-1784 (1992); Thuong and Helene, Angew. Chem. Int. Ed. Engl. 



-18- 



wo 97/13877 



PCT/US96/16342 



32: 666-690 (1993); Escude et al, Proc. Natl. Acad. Sci., 93: 4365-4369 (1996); and 
the like. Conditions for annealing single-stranded or duplex tags to their single- 
stranded or duplex complements are well known, e.g. Ji et al. Anal. Chem. 65: 1323- 
1 328 (1993); Cantor et al, U.S. patent 5,482,836; and the like. Use of triplex tags has 
5 the advantage of not requiring a "stripping" reaction with polymerase to expose the 
tag for annealing to its complement. 

Preferably, oligonucleotide tags of the invention employing triplex 
hybridization are double stranded DMA and the corresponding tag complements are 
single stranded. More preferably, 5-methylcylosine is used in place of cytosine in the 

1 0 tag complements in order to broaden the range of pH stability of the triplex formed 
between a tag and its complement. Preferred conditions for forming triplexes are 
fiilly disclosed in the above references. Briefly, hybridization takes place in 
concentrated salt solution, e.g. 1 .0 M NaCI, 1 .0 M potassium acetate, or the like, at 

.... .. below 5.5 ( or 6.5 .if 5-methylcytosine is employed). Hybridization temperature 

1 5 depends on the length and composition of the tag; however, for an 1 8-20-mer tag of 
longer, hybridization at room temperature is adequate. Washes may be conducted 
with less concentrated salt solutions, e.g. 10 mM sodium acetate, 100 mM MgCl2, pH 
5.8, at room temperature. Tags may be eluted from their tag complements by 
- incubation in a similar salt solution at pH 9.0. 

20 Minimally cross-hybridizing sets of oligonucleotide tags that form triplexes 

may be generated by the computer program of Appendix Ic, or similar programs. An 
exemplar)' set of double stranded 8-mer words are listed below in capital letters with 
the corresponding complements in small letters. Each such word differs from each of 
the other words in the set by three base pairs. 

25 

Table V 

Exemplary Minimally Cross-Hybridizing 
Set of DoubleStranded S-mer Tags 





-AAGGAGAG 


5' 


-AAAGGGGA 


5' 


-AGAGAAGA 


C 9 
-J 


-AGGGGGGG 


V 


-TTCCTCTC 


3' 


-TTTCCCCT 


3' 


-TCTCTTCT 


3 ' 


-TCCCCCCC 


y 


-t tcctctc 


3' 


-tttcccct 


3' 


-tctcttct 


3' 






-AAAAAAAA 


5' 


-AAGAGAGA 


c, * 


-AGGAAAAG 


5' 


-GAAAGGAG 




T 1* 1* 


3' 


-TTCTCTCT 


3' 


-TCCTTTTC 


3' 


-CTTTCCTC 


3' 


-tttttttt 


3' 


-t tctctct 


3' 


-tcctt tec 


3' 


"CtttCCtC 


C t 

■J 


-AAAAAGGG 


5' 


-AGAAGAGG 


5' 


-AGGAAGGA 


5' 


-GAAGA/i.GG 


3' 


-TTTTTCCC 


3' 


-TCTTCTCC 


3' 


-TCCTTCCT 


3' 


-CTTCTTCC 


-> t 


-tttttccc 


3' 


-tcttctcc 


3' 


-tcctt cct 


3' 


-cttcttcc 


C t 


-AAAGGAAG 


5' 


-AGAAGGAA 


S' 


-AGGGGAAA 


5' 


-GAAGAGAA 


y 


-TTTCCTTC 


3' 


-TCTTCCTT 


3' 


-TCCCCTTT 


3' 


-CTTCTCTT 


y 


-rrtccttc 


3' 


-tcttcctt 


3' 


-tccccttt 


3' 


-cttctctt 



-19- 



wo 9inysii 



PCT/US96/16342 



5 



JO Table VI 

Repertoire Size of Various Double Su-anded Tags 
That Form Triplexes with Their Tag Complements 

Nucleotide 
Difference 

between Maximal Size 

Oligonucleotides of Minimally Size of 

Oligonucieoiid of Minimally Cross- Repertoire 



5 



Size of 



. Hybridizing with Four Repertoire with 



Word Hybridizing Set Set Words 

Length 



Five Words 



4 


2 


8 


4096 


3.2 X lO"* 


6 


3 


8 


4096 


3.2x10'' 


8 


3 


16 


6.5 X lo"* 


1.05x10^ 


10 


5 


8 


4096 




15 


5 


92 






20 


6 


765 






20 


8 


92 






20 


10 









5 Preferably, repertoires of double stranded oligonucleotide tags of the invention 

contain at least 10 members; more preferably, repertoires of such tags contain at least 
1 00 members. Preferably, words are between 4 and 8 nucleotides in length for 
combinatorially synthesized double stranded oligonucletide tags, and oligonucleotide 
tags are between 1 2 and 60 base pairs in length. More preferably, such tags are 

0 between 1 8 and 40 base pairs in length. 

Solid Phase Supports 
Solid phase supports for use with the invention may have a wide variety of 
forms, including microparticles, beads, and membranes, slides, plates, micromachined 
chips, and the like. Likewise, solid phase supports of the invention may comprise a 

-20- 



wo 97/13877 



PCT/US96/16342 



10 



wide variety of compositions, including glass, plastic, silicon, alkanethiolate- 
derivatized gold, cellulose, low cross-linked and high cross-linked polystyrene, silica 
gel, polyamide, and the like. Preferably, either a population of discrete particles are 
employed such that each has a uniform coating, or population, of complementary 
sequences of the same tag (and no other), or a single or a few supports are employed 
with spatially discrete regions each containing a uniform coating, or population, of 
complementary sequences to the same tag (and no other). In the latter embodiment, 
the area of the regions may vary according to particular applications; usually, the 
regions range in area from several nm2, e.g. 3-5, to several hundred ^mi2, e.g. 100- 
500. Preferably, such regions are spatially discrete so that signals generated by 
events, e.g. fluorescent emissions, at adjacent regions can be resolved by the detection 
system being employed. In some applications, it may be desirable to have regions 
with uniform coatings of more than one tag complement, e.g. for simultaneous 
sequence analysis, or for bringing separately tagged molecules into close proximity. 
1 5 Tag complements may be used with the solid phase support that they are 

synthesized on, or they may be separately synthesized and attached to a solid phase 
support for use, e.g. as disclosed by Lund et al. Nucleic Acids Research, 16: 10861- 
10880 (1988); Albretsen et al. Anal. Biochem., 189: 40-50 (1990); Wolf et al. Nucleic 
- Acids Research, 15: 291 1-2926 (1987); or Ghosh et al. Nucleic Acids Research, 15: 
20 5353-5372 ( 1 987). Preferably, tag complements are synthesized on and used with the 
same solid phase support, which may comprise a variety of forms and include a 
variety of linking moieties. Such supports may comprise microparticles or arrays, or 
matrices, of regions where uniform populations of tag complements are synthesized. 
A wide variety of microparticle supports may be used with the invention, including 
25 microparticles made of controlled pore glass (CPG), highly cross-linked polystyrene, 
acrylic copolymers, cellulose, nylon, dextran, latex, polyacrolein, and the like, 
disclosed in the following exemplary references; Meth. Enzymol., Section A, pages 
1 1-147, vol. 44 (Academic Press, New York, 1976); U.S. patents 4.678,814; 
4,4 1 3.070; and 4,046;720; and Pon, Chapter 1 9, in Agrawal, editor. Methods in 
Molecular Biology, Vol. 20, (Humana Press, Totowa, NJ, 1993). Microparticle 
supports further include commercially available nucleoside-derivatized CPG and 
polystyrene beads (e.g. available from Applied Biosystems, Foster City, CA); 
derivatized magnetic beads; polystyrene grafted with polyethylene glycol (e.g., 
TentaGelTM^ Rapp pdymere, Tubingen Germany); and the like. Selection of the 
support characteristics, such as material, porosity, size, shape, and the like, and the 
type of linking moiety employed depends on the conditions under which the tags are 
used. For example, in applications involving successive processing with enzymes, 
supports and linkers that minimize steric hindrance of the enzymes and that facilitate 



10 



-21- 



wo 97/13877 



PCTAJS96/16342 



access to substrate are prefened. Other important factors to be considered in selecting 
the most appropriate microparticle support include size unifomiity, efficiency as a 
synthesis support, degree to which surface area known, and optical properties, e.g. as 
explain more fully below, clear smooth beads provide inslrumentational advantages 

5 when handling large numbers of beads on a surface. 

Exemplary linking moieties for attaching and/or synthesizing tags on 
microparticle surfaces are disclosed in Pon et al, Biotechniques, 6:768-775 (1988); 
Webb, U.S. patent 4,659,774; Barany et al. International patent application 
PCT/US9 1/06 103; Brown et al, J. Chem. Soc. Commun., 1989: 891-893; Damha et 

0 aK Nucleic Acids Research, 18: 3813-3821 (1990); Beattie et al. Clinical Chemistry, 
39: 719-722 (1993); Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 
(1992); and the like. 

As mentioned above, tag complements may also be synthesized on a single 
(or a few) solid phase support to form an array of regions uniformly coated with tag 
5 complements. That is, within each region in such an array the same tag complement 
is synthesized. Techniques for synthesizing such arrays are disclosed in McGall et al. 
International application PCT/US93/03767; Pease et al, Proc. Natl. Acad. Sci., 91 : 
5022-5026 (1994); Southern and Maskos, International application 
*PCT/GB89/01 1 14; Maskos and Southern (cited above); Southern et al. Genomics, 13: 
0 1008-1017 (1992); and Maskos and Southern, Nucleic Acids Research, 21: 4663- 
4669(1993). 

Preferably, the invention is implemented with microparticles or beads 
uniformly coated with complements of the same tag sequence. Microparticle supports 
and methods of covalently or noncovalently linking oligonucleotides to their surfaces 

5 are well known, as exemplified by the following references: Beaucage and Iyer (cited 
above); Gait, editor. Oligonucleotide Synthesis: A Practical Approach (IRL Press, 
Oxford, 1984); and the references cited above. Generally, the size and shape of a 
microparticle is not critical; however, microparticles in the size range of a few, e.g. 1- 
2, to several hundred, e.g. 200-1000 ^m diameter are preferable, as they facilitate the 

0 construction and manipulation of large repertoires of oligonucleotide tags with 
minimal reagent and sample usage. 

In some preferred applications, commercially available control led-pore glass 
(CPG) or polystyrene supports are employed as solid phase supports in the invention. 
Such supports come available with base-labile linkers and initial nucleosides attached. 

5 e.g. Applied Biosystems (Foster City, CA). Preferably, microparticles having pore 
size between 500 and 1 000 angstroms are employed. 

In other preferred applications, non-porous microparticles are employed for 
their optical properties, which may be advantageously used when tracking large 



22- 



wo 97/13877 



PCT/US96/ld342 



numbers of microparticles on planar supports, such as a microscope slide. 
Particularly preferred non-porous microparticles are the glycidal methacrylate (GMA) 
beads available from Bangs Laboratories (Carmel, IN). Such microparticles are 
useful in a variety of sizes and derivatized with a variety of linkage groups for 
5 synthesizing tags or tag complements. Preferably, for massively parallel 

manipulations of tagged microparticles, 5 ^m diameter GMA beads are employed. 



10 

Attaching Tags to Polynucleotides 
For Sorting onto Solid Phase Supports 
An important aspect of the invention is the sorting and attachment of a 
populations of polynucleotides, e.g. from a cDNA library, to microparticles or to 

1 5 separate regions on a solid phase support such that each microparticle or region has 
substantially only one kind of polynucleotide attached. This objective is 
accomplished by insuring that substantially all different polynucleotides have 
different tags attached. This condition, in turn, is brought about by taking a sample of 
- the fiill ensemble of tag-polynucleotide conjugates for analysis. (It is acceptable that 

20 identical polynucleotides have different tags, as it merely results in the same 

polynucleotide being operated on or analyzed twice in two different locations.) Such 
sampling can be carried out either overtly--for example, by taking a small volume 
from a larger mixture-after the lags have been attached to the polynucleotides, it can 
be carried out inherently as a secondary effect of the techniques used to process the 

25 polynucleotides and tags, or sampling can be carried out both overtly and as an 
inherent part of processing steps. 

Preferably, in constructing a cDNA library where substantially all different 
cDNAs have different tags, a tag repertoire is employed whose complexity, or number 
of distinct lags, greatly exceeds the total number of mRNAs extracted from a cell or 

30 tissue sample. Preferably, the complexity of the tag repertoire is at least 10 times that 
of the polynucleotide population; and more preferably, the complexity of the lag 
repertoire is at least 100 times that of the polynucleotide population. Below, a 
protocol is disclosed for cDNA library construction using a primer mixture that 
contains a full repertoire of exemplary 9- word tags. Such a mixture of tag-containing 

35 primers has a complexity of 8^, or about 1 .34 x 1 0^ As indicated by Winslow et al. 
Nucleic Acids Research, 19: 3251-3253 (1991), mRNA for library construction can 
be extracted from as few as 10-100 mammalian cells. Since a single manunalian cell 
contains about 5x10^ copies of mRNA molecules of about 3.4 x 1 0"* different kinds. 



-23- 



wo 97/13877 



PCT/US96/16342 



by standard techniques one can isolate the mRNA from about 100 cells, or 
(theoretically) about 5x10^ mRNA molecules. Comparing this number to the 
complexity of the primer mixture shows that without any additional steps, and even 
assuming that mRNAs are converted into cDNAs with perfect efficiency (1 % 

5 efficiency or less is more accurate), the cDNA library construction protocol results in 
a population containing no more than 37% of the total number of different tags. That 
is, without any overt sampling step at all, the protocol inherently generates a sample 
that comprises 37%, or less, of the tag repertoire. The probability of obtaining a 
double under these conditions is about 5%, which is within the preferred range. With 

0 mRNA from 10 cells, the fraction of the tag repertoire sampled is reduced to only 
3.7%, even assuming that all the processing steps take place at 100% efficiency. In 
fact, the efficiencies of the processing steps for constructing cDNA libraries are very 
low, a "rule of thumb" being that good library should contain about 10^ cDNA clones 
from mRNA extracted from 10^ mammalian cells. 

5 Use of larger amounts of mRNA in the above protocol, or for larger amounts 

of polynucleotides in general, where the number of such molecules exceeds the 
complexity of the tag repertoire, a tag-polynucleotide conjugate mixture potentially 
contains every possible pairing of tags and types of mRNA or polynucleotide. In such 
- cases, overt sampling may be implemented by removing a sample volume after a 

0 serial dilution of the starting mixture of tag-polynucleotide conjugates. The amount 
of dilution required depends on the amount of starting material and the efficiencies of 
the processing steps, which are readily estimated. 

If mRNA were extracted from 10^ cells (which would correspond to about 0.5 
Hg of poly(A)" RNA), and if primers were present in about 10-100 fold concentration 

5 excess--as is called for in a typical protocol, e.g. Sambrook et al. Molecular Cloning, 
Second Edition, page 8.61 [10 |iL 1.8 kb mRNA at 1 mg/mL equals about 1.68 x 10'" 
moles and 1 0 ^L 1 8-mer primer at 1 mg/mL equals about 1 .68 x 1 0*^ moles], then the 
total number of tag-polynucleotide conjugates in a cDNA library would simply be 
equal to or less than the starting number of mRNAs, or about 5x10** vectors 

0 containing tag-polynucleotide conjugates-again this assumes that each step in cDNA 
construction-first strand synthesis, second strand synthesis, ligation into a vector- 
occurs with perfect efficiency, which is a very conservative estimate. The actual 
number is significantly less. 

If a sample of n tag-polynucleotide conjugates are randomly drawn from a 
5 reaction mixture-as could be effected by taking a sample volume, the probability of 
drawing conjugates having the same tag is described by the Poisson distribution, 
P(r)=e' (>.)7r, where r is the number of conjugates having the same tag and 5l=np, 
where p is the probabilit>' of a given tag being selected. If n=10^ and p=l/(l .34 x 

- 24 - 



wo 97/13877 



PCT/US96/t6342 



10^), then ?i=.00746 and P(2)=2.76 x \Q'\ Thus, a sample of one million molecules 
gives rise to an expected number of doubles well within the preferred range. Such a 
sample is readily obtained as follows: Assume that the 5 x lO'* mRNAs are perfectly 
converted into 5 x 10** vectors with tag-cDNA conjugates as inserts and that the 5 x 

5 10** vectors are in a reaction solution having a volume of 100 )il. Four 10-fold serial 
dilutions may be carried out by transferring 10 pi from the original solution into a 
vessel containing 90 |il of an appropriate buffer, such as TE. This process may be 
repeated for three additional dilutions to obtain a 100 p.1 solution containing 5 x lO'' 
vector molecules per |il. A 2 fil aliquot from this solution yields 1 0^ vectors 

0 containing tag-cDNA conjugates as inserts. This sample is then amplified by straight 
forward transformation of a competent host cell followed by culturing. 

Of course, as mentioned above, no step in the above process proceeds with 
perfect efficiency. In particular, when vectors are employed to amplijfy* a sample of 
tag-polynucleotide conjugates, the step of transforming a host is very inefficient. 

5 Usually, no more than 1% of the vectors are taken up by the host and replicated. 
Thus, for such a method of amplification, even fewer dilutions would be required to 
obtain a sample of 10^ conjugates. 

A repertoire of oligonucleotide tags can be conjugated to a population of 
* polynucleotides in a number of ways, including direct enzymatic ligation, 

0 amplification, e.g. via PGR, using primers containing the tag sequences, and the like. 
The initial ligating step produces a very large population of tag-polynucleotide 
conjugates such that a single tag is generally attached to many different 
polynucleotides. However, as noted above, by taking a sufficiently small sample of 
the conjugates, the probability of obtaining "doubles," i.e. the same lag on two 

5 different polynucleotides, can be made negligible. Generally, the larger the sample 
the greater the probability of obtaining a double. Thus, a design trade-off exists 
between selecting a large sample of tag-polynucleotide conjugates— which, for 
example, ensures adequate coverage of a target polynucleotide in a shotgun 
sequencing operation or adequate representation of a rapidly changing mRNA pool, 

0 and selecting a small sample which ensures that a minimal number of doubles will be 
present. In most embodiments, the presence of doubles merely adds an additional 
source of noise or, in the case of sequencing, a minor complication in scanning and 
signal processing, as microparticles giving multiple fluorescent signals can simply be 
ignored. 

5 As used herein, the term "substantially all" in reference to attaching lags to 

molecules, especially polynucleotides, is meant to reflect the statistical nature of the 
sampling procedure employed to obtain a population of tag-molecule conjugates 
essentially free of doubles. The meaning of substantially all in terms of actual 



-25- 



wo 97/13877 PCT/US96/16342 

percentages of tag-molecule conjugates depends on how the tags are being employed. 
Preferably, for nucleic acid sequencing, substantially all means that at least eighty 
percent of the polynucleotides have unique tags attached. More preferably, it means 
that at least ninety percent of the polynucleotides have unique tags attached. Still 
5 more preferably, it means that at least ninety-five percent of the polynucleotides have 
unique tags attached. And, most preferably, it means that at least ninety-nine percent 
of the polynucleotides have unique tags attached. 

Preferably, when the population of polynucleotides consists of messenger 
RNA (mRNA), oligonucleotides tags may be attached by reverse transcribing the 
10 mRNA with a set of primers preferably containing complements of tag sequences. 
An exemplary set of such primers could have the following sequence (SEQ ID NO: 
1): 

5 '-mRNA- [A]n -3' 
15 [T] i9GG[W,W,W,C] qAC CAGCTG ATC-5 ' -biotin 



where "(W,W,W,C]9" represents the sequence of an oligonucleotide tag of nine 
. subunits of four nucleotides each and "[W,W,W,C]" represents the subunit sequences 
20 listed above, i.e. " W" represents T or A. The underlined sequences identify an 

optional restriction endonuclease site that can be used to release the polynucleotide 
from anachment to a solid phase support via the biotin, if one is employed. For the 
above primer, the complement attached to a microparticle could have the form: 

-5 5»- [G, W, W, W] 9TGG-linker-microparticle 

After reverse transcription, the mRNA is removed, e.g. by RNase H digestion, 
and the second strand of the cDNA is synthesized using, for example, a primer of the 
following fonm (SEQ ID NO: 2): 



30 



5 * -NRRGATCYNNN-3 ' 



where N is any one of A, T, G, or C; R is a purine-containing nucleotide, and Y is a 
pyrimidine-containing nucleotide. This particular primer creates a Bst Yl restriction 
35 site in the resulting double stranded DNA which, together with the Sal I site, 

facilitates cloning into a vector with, for example. Bam HI and Xho I sites. After Est 
Yl and Sal I digestion, the exemplary conjugate would have the form: 



-26- 



wo 97/13877 



PCT/US96/16342 



5'-RCGACCA[C,W,W,W]9GG[T]i9- cDNA -NNNR 

GGT[G,W, W, W] 9CC[A] 19- rDNA -NNNYCTAG-5 ' 

The polynucleotide-lag conjugates may then be manipulated using standard molecular 
5 biology techniques. For example, the above conjugate-which is actually a mixture- 
may be inserted into commercially available cloning vectors, e.g. Stratagene Cloning 
System (La JoUa, CA); transfected into a host, such as a commercially available host 
bacteria; which is then cultured to increase the number of conjugates. The cloning 
vectors may then be isolated using standard techniques, e.g. Sambrook et al, 

1 0 Molecular Cloning, Second Edition (Cold Spring Harbor Laboratory, New York, 
1989). Alternatively, appropriate adaptors and primers may be employed so that the 
conjugate population can be increased by PCR. 

Preferably, when the iigase-based method of sequencing is employed, the Bst 
Yl and Sal I digested fragments are cloned into a Bam HI-/Xho I-digested vector 

1 5 having the following single-copy restriction sites (SEQ ID NO: 3): 

5 ' -GA GGATG CCTTTAT GGATCC A CTCGAG ATCCCAATCCA-3 ' 
Fokl BamHI Xhol 

20 

This adds the Fok I site which will allow initiation of the sequencing process 
discussed more fully below. 

Tags can be conjugated to cDNAs of existing libraries by standard cloning 
methods. cDNAs are excised from their existing vector, isolated, and then ligated into 

25 a vector containing a repertoire of tags. Preferably, the tag-containing vector is 

linearized by cleaving with two restriction enzymes so that the excised cDNAs can be 
ligated in a predetermined orientation. The concentration of the linearized tag- 
containing vector is in substantial excess over that of the cDN A inserts so that 
ligation provides an inherent sampling of tags. 

30 A general method for exposing the single stranded tag after amplification 

involves digesting a target polynucleotide-containing conjugate with the 5'->3' 
exonuclease activity of T4 DNA polymerase, or a like enzyme. When used in the 
presence of a single deoxynucleoside triphosphate, such a polymerase will cleave 
nucleotides from 3' recessed ends present on the non-template strand of a double 

35 stranded fragment until a complement of the single deoxynucleoside triphosphate is 
reached on the template strand. When such a nucleotide is reached the 5*— >3' 
digestion effectively ceases, as the polymerase's extension activity adds nucleotides at 
a higher rate than the excision activity removes nucleotides. Consequently, single 

.27- 



wo 97/13877 



PCT/US96/16342 



stranded tags constructed with three nucleotides are readily prepared for loading onto 
solid phase supports. . 

The technique may also be used to preferentially methylate interior Fok I sites 
of a target polynucleotide while leaving a single Fok I site at the terminus of the 

5 polynucleotide unmethylated. First, the tenninal Fok I site is rendered single stranded 
using a polymerase with deoxycytidine triphosphate. The double stranded portion of 
the fragment is then methylated, after which the single stranded terminus is filled in 
with a DNA polymerase in the presence of all four nucleoside triphosphates, thereby 
regenerating the Fok 1 site. Clearly, this procedure can be generalized to 

0 endonucleases other than Fok I. 

After the oligonucleotide tags are prepared for specific hybridization, e.g. by 
rendering them single stranded as described above, the polynucleotides are mixed 
with microparticles containing the complementary sequences of the tags under 
conditions that favor the formation of perfectly matched duplexes between the tags 

5 and their complements. There is extensive guidance in the literature for creating these 
conditions. Exemplary references providing such guidance include Wetmur, Critical 
Reviews in Biochemistry and Molecular Biology, 26: 227-259 (1991); Sambrook el 
al. Molecular Cloning: A Laboratory Manual, 2nd Edition (Cold Spring Harbor 
. Laboratory, New York, 1989); and the like. Preferably, the hybridization conditions 

0 are sufficiently stringent so that only perfectly matched sequences form stable 

duplexes. Under such conditions the polynucleotides specifically hybridized through 
their tags may be ligated to the complementary sequences attached to the 
microparticles. Finally, the microparticles are washed to remove polynucleotides with 
unligated and/or mismatched tags. 

5 When CPG microparticles conventionally employed as synthesis supports are 

used, the density of tag complements on the microparticle surface is typically greater 
than that necessary for some sequencing operations. That is, in sequencing 
approaches that require successive treatment of the attached polynucleotides with a 
variety of enzymes, densely spaced polynucleotides may tend to inhibit access of the 

0 relatively bulky enzymes to the polynucleotides. In such cases, the polynucleotides 
are preferably mixed with the microparticles so that tag complements are present in 
significant excess, e.g. from 10: 1 to 100:1, or greater, over the polynucleotides. This 
ensures that the density of polynucleotides on the microparticle surface will not be so 
high as to inhibit enzyme access. Preferably, the average inter-polynucleotide spacing 

5 on the microparticle surface is on the order of 30- 1 00 nm. Guidance in selecting 
ratios for standard CPG supports and Ballotini beads (a type of solid glass support) is 
found in Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 (1992). 
Preferably, for sequencing applications, standard CPG beads of diameter in the range 



-28- 



wo 97/13877 



PCT/US96/16342 



of 20-50 nm are loaded with about lO^ polynucleotides, and GMA beads of diameter 
in the range of 5-10 ^m are loaded with a few tens of thousand of polynucleotides, 
e.g. 4x 104 to6x lOf 

In the preferred embodiment, tag complements are synthesized on 

5 microparticles combinatorially; thus, at the end of the synthesis, one obtains a 

complex mixture of microparticles from which a sample is taken for loading lagged 
polynucleotides. The size of the sample of microparticles will depend on several 
factors, including the size of the repertoire of tag complements, the nature of the 
apparatus for used for observing loaded microparticles— e.g. its capacity, the tolerance 

10 for multiple copies of microparticles with the same tag complement (i.e. "bead 
doubles"), and the like. The following table provide guidance regarding 
microparticle sample size, microparticle diameter, and the approximate physical 
dimensions of a packed array of microparticles of various diameters. 



15 



25 



Microparticle diameter 5 10^m 20\xn\ 40 ^m 

Max. no. 

polynucleotides loaded 

atlperlO^sq. 3 x 10^ 1.26x10* 5 x 10* 

angstrom 

Approx. area of 
monolayer of 10* 

microparticles .45 x .45 cm 1 x 1 cm 2 x 2 cm 4 x 4 cm 



20 The probability that the sample of microparticles contains a given tag complement or 
is present in multiple copies is described by the Poisson distribution, as indicated in 
the following table. 



Table VII 



-29- 



wo 97/J3877 PCT/USW16342 

Fraction of 
microparticles in 
sample carrying 



Number of 
microparticles in 
sample (as fraction 
of repertoire sizeX 
m 


Fraction of 
repertoire of tag 
complements 
present in 
samnle 


Fraction of 
microoarticles in 
sample with unioue 
ta£ comolement 

m(e"'")/2 


comnlement nnp 

VrV/lllL/lVI lJ\rl 11 03 UllC 

Other micronarticlp 
in cflmnlp 

\ DeaQ uouDies J, 

m2(e-'n)/2 


1. 000 


0.63 


0.37 


0.18 


.693 


0.50 


0.35 


0.12 


.405 


0.33 


0.27 


0.05 


.285 


0.25 


0.21 


0.03 


.223 


0.20 


0.18 


0.02 


.105 


0.10 


0.09 


0.005 


.010 


0.01 


0.01 





High Specificity Sorting and Panning 
5 The kinetics of sorting depends on the rate of hybridization of oligonucleotide 

tags to their tag complements which, in turn, depends on the complexity of the tags in 
- the hybridization reaction. Thus, a trade off exists between sorting rate and tag 
complexity, such that an increase in sorting rate may be achieved at the cost of 
reducing the complexity of the tags involved in the hybridization reaction. As 

1 0 explained below, the effects of this trade off may be ameliorated by "panning." 

Specificity of the hybridizations may be increased by taking a sufficiently 
small sample so that both a high percentage of tags in the sample are unique and the 
nearest neighbors of substantially all the tags in a sample differ by at least two words. 
This latter condition may be met by taking a sample that contains a number of tag- 

1 5 polynucleotide conjugates that is about 0. 1 percent or less of the size of the repertoire 
being employed. For example, if tags are constructed with eight words selected from 
Table II, a repertoire of 8^, or about 1 .67 x 1 0'^, tags and tag complements are 
produced. In a library of tag-cDNA conjugates as described above, a 0. 1 percent 
sample means that about 16,700 different tags are present. If this were loaded directly 

20 onto a repertoire-equivalent of microparticles, or in this example a sample of 1 .67 x 
10"^ microparticles, then only a sparse subset of the sampled microparticles would be 
loaded. The density of loaded microparticles can be increase-for example, for more 
efficient sequencing--by undertaking a "panning" step in which the sampled lag- 
cDNA conjugates are used to separate loaded microparticles from unloaded 

25 microparticles. Thus, in the example above, even though a "0.1 percent" sample 



-30- 



wo 97/13877 PCTAJS96/16342 

contains only 16,700 cDNAs, the sampling and panning steps may be repeated until 
as many loaded microparticles as desired are accumulated. 

A panning step may be implemented by providing a sample of tag-cDNA 
conjugates each of which contains a capture moiety at an end opposite, or distal to, 
5 the oligonucleotide tag. Preferably, the capture moiety is of a type which can be 
released from the tag-cDNA conjugates, so that the tag-cDNA conjugates can be 
sequenced with a single-base sequencing method. Such moieties may comprise 
biotin, digoxigenin, or like ligands, a triplex binding region, or the like. Preferably, 
such a capture moiety comprises a biotin component. Biotin may be attached to tag- 

1 0 cDNA conjugates by a number of standard techniques. If appropriate adapters 

containing PGR primer binding sites are attached to lag-cDNA conjugates, biotin may 
be attached by using a biotinylated primer in an amplification after sampling. 
Alternatively, if the tag-cDNA conjugates are inserts of cloning vectors, biotin may be 
attached after excising the tag-cDNA conjugates by digestion with an appropriate 

1 5 restriction enzyme followed by isolation and filling in a protruding strand distal to the 
lags with a DNA polymerase in the presence of biotinylated uridine triphosphate. 

After a tag-cDNA conjugate is captured, it may be released from the biotin 
moiet)' in a number of ways, such as by a chemical linkage that is cleaved by 
-reduction, e.g. Henman et al. Anal. Biochem., 156: 48-55 (1986), or that is cleaved 

20 photochemically, e.g. Olejnik et al, Nucleic Acids Research, 24: 361-366 (1996), or 
that is cleaved en2ymatically by introducing a restriction site in the PGR primer. The 
latter embodiment can be exemplified by considering the Iibrar>' of tag-polynucleotide 
conjugates described above: 

25 5'-RCGACCA[C,W,W,W]9GG[T]i9- cDNA -NNNR 

GGT[G, W,W, W] 9CC[A] 19" rDNA -NNNYCTAG-5' 

The following adapters may be ligated to the ends of these fragments to permit 
amplification by PGR: 

30 

5 • - xxxxxxxxxxxxxxxxxxxx 

XXXXXXXXXXXXXXXXXXXXYGAT 

35 Right Adapter 



GATCZZACTAGTZZZZZZZZZZZZ-3 * 
40 ZZTGATCAZZZZZZZZZZZZ 



•31- 



wo 9in3gll 



PCT/US96/16342 



Left Adapter 



ZZTGATCAZZZZZZZZZZZZ-5 ' -biotin 



Left Primer 



where "ACTAGT" is a Spe I recognition site (which leaves a staggered cleavage 
ready for single base sequencing), and the X's and Z's are nucleotides selected so that 

0 the annealing and dissociation temperatures of the respective primers are 

approximately the same. After ligation of the adapters and amplification by PGR 
using the biotinylated primer, the tags of the conjugates are rendered single stranded 
by the exonuclease activity of T4 DNA polymerase and conjugates are combined with 
a sample of microparticles, e.g. a repertoire equivalent, with tag complements 

5 attached. After annealing under stringent conditions (to minimize mis-attachment of 
tags), the conjugates are preferably ligated to their tag complements and the loaded 
microparticles are separated fi-om the unloaded microparticles by capture with 
avidinated magnetic beads, or like capture technique. 

Returning to the example, this process results in the accumulation of about 

0 10,500 (=16,700 X .63) loaded microparticles with different tags, which may be 
released from the magnetic beads by cleavage with Spe L By repeating this process 
40-50 times with new samples of microparticles and tag-cDNA conjugates, 4-5 x 10^ 
cDNAs can be accumulated by pooling the released microparticles. The pooled 
microparticles may then be simultaneously sequenced by a single-base sequencing 

5 technique. 

Determining how many times to repeat the sampling and panning steps-or 
more generally, determining how many cDNAs to analyze, depends on one's 
objective. If the objective is to monitor the changes in abundance of relatively 
common sequences, e.g. making up 5% or more of a population, then relatively small 

0 samples, i.e. a small fraction of the total population size, may allow statistically 
significant estimates of relative abundances. On the other hand, if one seeks to 
monitor the abundances of rare sequences, e.g. making up 0.1% or less of a 
population, then large samples are required. Generally, there is a direct relationship 
between sample size and the reliability of the estimates of relative abundances based 

5 on the sample. There is extensive guidance in the literature on determining 

appropriate sample sizes for making reliable statistical estimates, e.g. Roller et al. 
Nucleic Acids Research, 23:185-191 (1994); Good, Biometrika, 40: 16-264 (1953); 
Bunge et al, J. Am. Stat. Assoc., 88: 364-373 (1993); and the like. Preferably, for 



-32- 



wo 97/13877 



PCTAJS96/16342 



monitoring changes in gene expression based on the analysis of a series of cDNA 
libraries containing lO^ to 10^ independent clones of 3.0-3.5 x different 
sequences, a sample of at least 10^ sequences are accumulated for analysis of each 
library. More preferably, a sample of at least 10^ sequences are accumulated for the 
5 analysis of each librar>'; and most preferably, a sample of at least 5 x 10^ sequences 
are accumulated for the analysis of each library. Alternatively, the number of 
sequences sampled is preferably sufficient to estimate the relative abundance of a 
sequence present at a frequency within the range of 0.1% to 5% with a 95% 
confidence limit no larger than 0.1% of the population size. 

10 

Single Base DNA Sequencing 
The present invention can be employed with conventional methods of DNA 
sequencing, e.g. as disclosed by Hultman et al, Nucleic Acids Research, 17: 4937- 
4946 (1989). However, for parallel, or simultaneous, sequencing of multiple 

1 5 polynucleotides, a DNA sequencing methodology is preferred that requires neither 
electrophoretic separation of closely sized DNA fragments nor analysis of cleaved 
nucleotides by a separate analytical procedure, as in peptide sequencing. Preferably, 
the methodology permits the stepwise identification of nucleotides, usually one at a 
- lime, in a sequence through successive cycles of treatment and detection. Such 

20 methodologies are referred to herein as "single base" sequencing methods. Single 
base approaches are disclosed in the following references: Cheeseman, U.S. patent 
5,302,509; Tsien et al. International application WO 91/06678; Rosenthal et al. 
International application WO 93/21340; Canard et al. Gene, 148: 1-6 (1994); and 
Metzker et al. Nucleic Acids Research, 22: 4259-4267 (1994). 

25 A "single base" method of DNA sequencing which is suitable for use with the 

present invention and which requires no electrophoretic separation of DNA fragments 
is described in International application PCT/US95/03678. Briefly, the method 
comprises the following steps: (a) ligating a probe to an end of the polynucleotide 
having a protruding strand to form a ligated complex, the probe having a 

30 complementary protruding strand to that of the polynucleotide and the probe having a 
nuclease recognition site; (b) removing unligated probe from the ligated complex; (c) 
identifying one or more nucleotides in the protruding strand of the polynucleotide by 
the identity of the ligated probe; (d) cleaving the ligated complex with a nuclease; and 
(e) repeating steps (a) through (d) imtil the nucleotide sequence of the polynucleotide, 

35 or a portion thereof, is determined. 

A single signal generating moiety, such as a single fluorescent dye, may be 
employed when sequencing several different target polynucleotides attached to 
different spatially addressable solid phase supports, such as fixed microparticles, in a 



-33 



wo 97/13877 



PCTAJS96/16342 



parallel sequencing operation. This may be accomplished by providing four sets of 
probes that are applied sequentially to the plurality of target polynucleotides on the 
different microparticles. An exemplary set of such probes are shown below: 



Set i 

ANKNN : . im 

N. . .NNTT. . .T* 

dCNNNN . . . NN 

N . . . NNTT . . . T 

dGNNNN . . . NN 

N . . . NNTT . . . T 

dTNNNN . . . NN 

N . . . NNTT . . . T 



Set 2 



Set 3 



dANNNN . . . NN dANNNN . . . NN dANNNN . . . NN 

d N...NNTT...T N...NNTT...T 



N . . . NNTT . . . T 



CNNNN . . . NN dCNNNN . . . NN dCNNNN . . . NN 

N. . .NNTT. . .T* N. . .NNTT. . .T N. . .NNTT. . .T 

dGNNNN . . . NN GNNNN . . . NN dGNNNN . . . NN 

N. . ,NNTT. . .T N. . .NNTT. . .T* N. . .NNTT. . .T 

dTNNNN . . . NN dTNNNN . . . NN TNNNN . . . NN 

N. . .NNTT. . .T N. . .NNTT. . .T N. . .NNTT. . .T* 



10 



15 



20 



25 



where each of the listed probes represents a mixture of 4^=64 oligonucleotides such 
that the identity of the 3' terminal nucleotide of the top strand is fixed and the other 
positions in the protruding strand are filled by every 3-mer permutation of nucleotides, 
^ or complexity reducing analogs. The listed probes are also shown with a single 
stranded poly-T tail with a signal generating moiety attached to the terminal thymidine, 
shown as "T*". The "d" on the unlabeled probes designates a ligation-blocking moiety 
or absense of 3'-hydroxyl, which prevents unlabeled probes from being iigated. 
Preferably, such 3'-terminal nucleotides are dideoxynucleotides. In this embodiment, 
the probes of set lare first applied to the plurality of target polynucleotides and u-eated 
with a ligase so that target polynucleotides having a thymidine complementary to the 3' 
terminal adenosine of the labeled probes are Iigated. The unlabeled probes are 
simultaneously applied to minimize inappropriate ligations. The locations of the target 
polynucleotides that form Iigated complexes with probes terminating in "A" are 
identified by the signal generated by the label carried on the probe. After washing and 
cleavage, the probes of set 2 are applied. In this case, target polynucleotides forming 
Iigated complexes with probes terminating in "C" are identified by location. Similarly, 
the probes of sets 3 and 4 are applied and locations of positive signals identified. This 
process of sequentially applying the four sets of probes continues until the desired 
number of nucleotides are identified on the target polynucleotides. Clearly, one of 
ordinary skill could construct similar sets of probes that could have many variations, 
such as having protruding strands of different lengths, different moieties to block 
ligation of unlabeled probes, different means for labeling probes, and the like. 



-34- 



wo 97/13877 



PCT/US96/16342 



Apparatus for Sequencing Populations of Polynucleotides 
An objective of the invention is to sort identical molecules, particularly 
polynucleotides, onto the surfaces of microparticles by the specific hybridization of 
tags and their complements. Once such sorting has taken place, the presence of the 
5 molecules or operations performed on them can be detected in a niunber of ways 
depending on the nature of the tagged molecule, whether microparticles are detected 
separately or in "batches," whether repeated measurements are desired, and the like. 
Typically, the sorted molecules are exposed to ligands for binding, e.g. in drug 
development, or are subjected chemical or enzymatic processes, e.g. in polynucleotide 

1 0 sequencing. In both of these uses it is often desirable to simultaneously observe 

signals corresponding to such events or processes on large numbers of microparticles. 
Microparticles carrying sorted molecules (referred to herein as "loaded" 
microparticles) lend themselves to such large scale parallel operations, e.g. as 
demonstrated by Lam et al (cited above). 

1 5 Preferably, whenever light-generating signals, e.g. chemiluminescent, 

fluorescent, or the like, are employed to detect events or processes, loaded 
microparticles are spread on a planar substrate, e.g. a glass slide, for examination with 
a scanning system, such as described in International patent applications 
. PCT/US91/09217,PCT/NL90/00081,andPCT/US95/01886. The scanning system 

20 should be able to reproducibly scan the substrate and to define the positions of each 
microparticle in a predetermined region by way of a coordinate system. In 
polynucleotide sequencing applications, it is important that the positional 
identification of microparticles be repeatable in successive sc£m steps. 

Such scarming systems may be constructed from commercially available 

25 components, e.g. x-y translation table controlled by a digital computer used with a 
detection system comprising one or more photomultiplier tubes, or alternatively, a 
CCD array, and appropriate optics, e.g. for exciting, collecting, and sorting 
fluorescent signals. In some embodiments a confocal optical system may be 
desirable. An exemplary scaiming system suitable for use in four-color sequencing is 

30 illustrated diagranunatically in Figure 5. Substrate 300, e.g. a microscope slide with 
fixed microparticles, is placed on x-y translation table 302, which is connected to and 
controlled by an appropriately programmed digital computer 304 which may be any of 
a variety of commercially available personal computers, e.g. 486-based machines or 
PowerPC model 7100 or 8100 available form Apple Computer (Cupertino, CA). 

35 Computer software for table translation and data collection fianctions can be provided 
by commercially available laboratory software, such as Lab Windows, available from 
National Instruments. 



35- 



wo 97/13877 PCT/US96/16342 

Substrate 300 and table 302 are operationally associated with microscope 306 
having one or more objective lenses 308 which are capable of collecting and 
delivering light to microparticles fixed to substrate 300. Excitation beam 310 from 
light source 3 1 2, which is preferably a laser, is directed to beam splitter 3 1 4, e.g. a 
5 dichroic mirror, which re-directs the beam through microscope 306 and objective lens 
308 which, in turn, focuses the beam onto substrate 300. Lens 308 collects 
fluorescence 3 1 6 emitted from the microparticles and directs it through beam splitter 
3 1 4 to signal distribution optics 3 1 8 which, in turn, directs fluorescence to one or 
more suitable opto-electronic devices for converting some fluorescence characteristic. 

1 0 e.g. intensity, lifetime, or the like, to an electrical signal. Signal distribution optics 
3 1 8 may comprise a variety of components standard in the art, such as bandpass 
filters, fiber optics, rotating mirrors, fixed position mirrors and lenses, diffraction 
gratings, and the like. As illustrated in Figure 2, signal distribution optics 3 1 8 directs 
fluorescence 3 1 6 to four separate photomultiplier tubes, 330, 332, 334, and 336, 

1 5 whose output is then directed to pre-amps and photon counters 350, 352, 354, and 
356. The output of the photon counters is collected by computer 304, where it can be 
stored, analyzed, and viewed on video 360. Alternatively, signal distribution optics 
3 1 8 could be a diffraction grating which directs fluorescent signal 3 1 8 onto a CCD 
- £irray. 

20 The stability and reproducibility of the positional localization in scanning will 

determine, to a large extent, the resolution for separating closely spaced 
microparticles. Preferably, the scanning systems should be capable of resolving 
closely spaced microparticles, e.g. separated by a particle diameter or less. Thus, for 
most applications, e.g. using CPG microparticles, the scanning system should at least 

25 have the capability of resolving objects on the order of 1 0-1 00 urn. Even higher 
resolution may be desirable in some embodiments, but with increase resolution, the 
time required to fully scan a substrate will increase; thus, in some embodiments a 
compromise may have to be made between speed and resolution. Increases in 
scanning time can be achieved by a system which only scans positions where 

30 microparticles are known to be located, e.g from an initial fijll scan. Preferably, 

microparticle size and scanning system resolution are selected to permit resolution of 
fluorescently labeled microparticles randomly disposed on a plane at a density 
between about ten thousand to one hundred thousand microparticles per cm^. 

In sequencing applications, loaded microparticles can be fixed to the surface 

35 of a substrate in variety of ways. The fixation should be strong enough to allow the 
microparticles to undergo successive cycles of reagent exposure and washing without 
significant loss. When the substrate is glass, its surface may be derivatized with an 
alkylamino linker using commercially available reagents, e.g. Pierce Chemical, which 



-36- 



wo 97/13877 



PCTAJS96/16342 



in tum may be cross-linked to avidin, again using conventional chemistries, to form 
an avidinaled surface. Biotin moieties can be introduced to the loaded microparticles 
in a number of ways. For example, a fraction, e.g. 10-15 percent, of the cloning 
vectors used to attach tags to polynucleotides are engineered to contain a unique 
5 restriction site (providing sticky ends on digestion) immediately adjacent to the 
polynucleotide insert at an end of the polynucleotide opposite of the tag. The site is 
excised with the polynucleotide and tag for loading onto microparticles. After 
loading, about 10-15 percent of the loaded polynucleotides will possess the unique 
restriction site distal from the microparticle surface. After digestion with the 

1 0 associated restriction endonuclease, an appropriate double stranded adaptor 

containing a biotin moiety is ligated to the sticky end. The resulting microparticles 
are then spread on the avidinated glass surface where they become fixed via the 
biotin-avidin linkages. 

Alternatively and preferably when sequencing by ligation is employed, in the 

1 5 initial ligation step a mixture of probes is applied to the loaded microparticle; a 

fraction of the probes contain a type lis restriction recognition site, as required by the 
sequencing method, and a fi*action of the probes have no such recognition site, but 
instead contain a biotin moiety at its non-ligating end. Preferably, the mixture 
* comprises about 10-15 percent of the biotinylated probe. 

20 In still another alternative, when DNA-loaded microparticles are applied to a 

glass substrate, the DNA may nonspecifically adsorb to the glass surface upon several 
hours, e.g. 24 hours, incubation to create a bond sufficiently strong to permit repeated 
exposures to reagents and washes without significant loss of microparticles. 
Preferably, such a glass substrate is a flow cell, which may comprise a channel etched 

25 in a glass slide. Preferably, such a channel is closed so that fluids may be pumped 
through it and has a depth sufficiently close to the diameter of the microparticles so 
that a monolayer of microparticles is trapped within a defined observation region. 

Identification of Novel Polynucleotides 
30 in cDNA Libraries 

Novel polynucleotides in a cDNA library can be identified by consu-ucting a 
library of cDNA molecules attached to microparticles, as described above. A large 
fraction of the library, or even the entire library, can then be partially sequenced in 
parallel. After isolation of mRNA, and perhaps normalization of the population as 
35 taught by Soares et al, Proc. Natl. Acad. Sci., 91 : 9228-9232 (1994), or like 

references, the following primer may by hybridized to the polyA tails for first strand 
synthesis with a reverse transcriptase using conventional protocols (SEQ ID NO: 1 ): 



-37 



wo 97/13877 



PCT/US96/16342 



S'-mRNA- [AJn -3' 

(T)i9-[primer site]-GG[W,W,W/C] 9ACCAGCTGATC- 5 ' 

where [W,W,W,C]9 represents a tag as described above, "ACCAGCTGATC" is an 
optional sequence forming a restriction site in double stranded form, and "primer site" 
is a sequence coihmon to all members of the library that is later used as a primer 
binding site for amplifying polynucleotides of interest by PGR. 

After reverse transcription and second strand synthesis by conventional 
techniques, the double stranded fragments are inserted into a cloning vector as 
described above and amplified. The amplified library is then sampled and the sample 
amplified. The cloning vectors from the amplified sample are isolated, and the tagged 
cDNA fragments excised and purified. After rendering the tag single stranded with a 
polymerase as described above, the fragments are methylated and sorted onto 
microparticles in accordance with the invention. Preferably, as described above, the 
cloning vector is constructed so that the tagged cDNAs can be excised with an 
endonuclease, such as Fok I, that will allow immediate sequencing by the preferred 
single base method after sorting and ligation to microparticles. 

Stepwise sequencing is then carried out simultaneously on the whole library, 
or one or more large fractions of the library, in accordance with the invention until a 
sufficient number of nucleotides are identified on each cDNA for unique 
representation in the genome of the organism from which the library is derived. For 
example, if the librar>' is derived from mammalian mRNA then a randomly selected 
sequence 14-15 nucleotides long is expected to have unique representation among the 
2-3 thousand megabases of the typical mammalian genome. Of course identification 
of far fewer nucleotides would be sufficient for unique representation in a library 
derived from bacteria, or other lower organisms. Preferably, at least 20-30 
nucleotides are identified to ensure unique representation and to permit construction 
of a suitable primer as described below. The tabulated sequences may then be 
compared to known sequences to identify unique cDNAs. 

Unique cDNAs are then isolated by conventional techniques, e.g. constructing 
a probe from the PGR amplicon produced with primers directed to the prime site and 
the portion of the cDNA whose sequence was determined. The probe may then be 
used to identify the cDNA in a library using a conventional screening protocol. 

The above method for identifying new cDNAs may also be used to fingerprint 
mRNA populations, either in isolated measurements or in the context of a 
dynamically changing population. Partial sequence information is obtained 
simultaneously from a large sample, e.g. ten to a hundred thousand, or more, of 
cDNAs attached to separate microparticles as described in the above method. 



38- 



wo 97/13877 



PCT/US96/16342 



Example 1 

Construction of a Tag Library 
An exemplary tag library is constructed as follows to form the chemically 
5 synthesized 9-word tags of nucleotides A, G, and T defined by the formula: 

3'-TGGC-[^(A,GJ)9]-CCCCp 

where "['*(A,G,T)9]" indicates a tag mixture where each tag consists of nine 4-mer 
1 0 words of A, G, and T; and "p" indicate a 5' phosphate. This mixture is ligated to the 
following right and left primer binding regions (SEQ ID NO: 4 and SEQ ID NO 5): 



15 



25 



30 



5'- AGTGGCTGGGCATCGGACCG 5»- GGGGCCCAGTCAGCGTCGAT 

TCACCGACCCGTAGCCp GGGTCAGTCGCAGCTA 



LEFT RIGHT 



The right and left primer binding regions are ligated to the above tag mixture, after 
which the single stranded portion of the ligated structure is filled with DNA 
20 'polymerase then mixed with the right and left primers indicated below and amplified 
to give a tag library (SEQ ID NO: 6). 



Left Primer 

5 • - AGTGGCTGGGCATCGGACCG 



5'- AGTGGCTGGGCATCGGACCG- (A, G, T ) 9] -GGGGCCCAGTCAGCGTCGAT 
TCACCGACCCGTAGCCTGGC- {A,G, T) 9] -CCCCGGGTCAGTCGCAGCTA 



CCCCGGGTC AGTCGCAGCTA- 5 
Right Primer 



35 The underlined portion of the left primer binding region indicates a Rsr II recognition 
site. The left-most underlined region of the right primer binding region indicates 
recognition sites for Bsp 1201, Apa I, and Eco O 1091, and a cleavage site for Hga I. 
The right-most underlined region of the right primer binding region indicates the 
recognition site for Hga L Optionally, the right or left primers may be synthesized 

40 with a biotin attached (using conventional reagents, e.g. available from Clontech 
Laboratories, Palo Alto, CA) to facilitate purification after amplification and/or 
cleavage. 



-39. 



wo 97/13877 PCT/US96/16342 



NOT FURNISHED UPON FILING 



- 40 - 



wo 97/13877 



PCT/US96/16342 



primer binding site 
i 



Ppu MI site 

I 




(plasmid) -5 * 



-AAAAGGAGGAGGCCTTGATAGAGAGGACCT- 



5 



10 



-CAAATTTG-CCTAGG-AGAAGGAGAAGGAGAAGG- 



t 



T 



Bam HI site 



Pme 1 site 



15 



The plasmid is cleaved with Ppu MI and Pme I (to give a Rsr Il-compatible end and a 
flush end so that the insert is oriented) and then methylated with DAM methylase. 
The tag-containing construct is cleaved with Rsr II and then ligated to the open 
plasmid, after which the conjugate is cleaved with Mho I and Bam HI to permit 
20 ligation and closing of the plasmid. The plasmid is then amplified and isolated and 
- used in accordance with the invention. 



In this experiment, to test the capability of the method of the invention to 
detect genes induced as a result of exposure to xenobiotic compounds, the gene 
expression profile of rat liver tissue is examined following administration of several 
compounds known to induce the expression of cytochrome P-450 isoenzymes. The 
30 results obtained from the method of the invention are compared to results obtained 

from reverse transcriptase PCR measurements and inmiunochemical measurements of 
the cytochrome P-450 isoenzymes. Protocols and materials for the latter assays are 
described in Morris et al. Biochemical Pharmacology, 52: 781-792 (1996). 



35 200-300 g are used, and food and water are available to the animals ad lib. Test 
compounds are phenobarbital (PB), metyrapone (MET), dexamethasone (DEX), 
clofibrate (CLO), com oil (CO), and P-naphthoflavone (BNF), and are available from 
Sigma Chemical Co. (St. Louis, MO). Antibodies against specific P-450 enzymes are 
available from the following sources: rabbit anti-rat CYP3A1 from Human Biologies, 

40 Inc. (Phoenix, AZ); goat anti-rat CYP4A1 from Daiichi Pure Chemicals Co. (Tokyo, 



25 



Example 3 

Changes in Gene Expression Profiles in Liver Tissue of Rats 
Exposed to Various Xenobiotic Agents 



Male Sprague-Dawley rats between the ages of 6 and 8 weeks and weighing 



-41 - 



wo 97/13877 PCT/US96/16342 

Japan); monoclonal mouse anti-rat CYPl Al, monoclonal mouse anti-rat CYP2C11, 
goat anti-rat CYP2E1, and monoclonal mouse anti-rat CYP2B1 from Oxford 
Biochemical Research, Inc. (Oxford, MI). Secondary antibodies (goat anti-rabbit IgG. 
rabbit anti-goat IgG and goat anti-mouse IgG) are available from Jackson 
5 ImmunoResearch Laboratories (West Grove, PA). 

Animals are administered either PB (1 00 mg/kg), BNF (1 00 mg/kg), MET 
(100 mg/kg), DEX ( 1 00 mg/kg), or CLO (250 mg/kg) for 4 consecutive days via 
intraperitoneal injection following a dosing regimen similar to that described by 
Wang et al. Arch. Biochem. Biophys. 290: 355-361 (1991). Animals treated with 
10 H2O and CO are used as controls. Two hours following the last injection (day 4), 
animals are killed, and the livers are removed. Livers are immediately frozen and 
stored at -TO^C. 

Total RNA is prepared from frozen liver tissue using a modification of the 
method described by Xie et al, Biotechniques, 1 1 : 326-327 (1 991 ). Approximately 
1 5 1 00-200 mg of liver tissue is homogenized in the RNA extraction buffer described by 
Xie et al to isolate total RNA. The resulting RNA is reconstituted in 
diethylpyrocarbonate-lreated water, quantified spectrophotometrically at 260 nm, and 
adjusted to a concentration of 100 ^ig/ml. Total RNA is stored in 
- diethylpyrocarbonate-treated water for up to 1 year at -lO^C without any apparent 
20 degradation. RT-PCR and sequencing are performed on samples from these 
preparations. 

For sequencing, samples of RNA corresponding to about 0.5 \xg of poIy(A)'^ 
RNA are used to construct libraries of tag-cDNA conjugates following the protocol 
described in the section entitled "Attaching Tags to Polynucleotides for Sorting onto 

25 Solid Phase Supports," with the following exception: the tag repertoire is constructed 
from six 4-nucleotide words from Table II. Thus, the complexity of the repertoire is 
8^ or about 2.6 x 10^. For each tag-cDNA conjugate library constructed, ten samples 
of about ten thousand clones are taken for amplification and sorting. Each of the 
amplified samples is separately applied to a fixed monolayer of about IO6 10 |im 

30 diameter GMA beads containing tag complements. That is, the "sample" of tag 

complements in the GMA bead population on each monolayer is about four fold the 
total size of the repertoire, thus ensuring there is a high probability that each of the 
sampled tag-cDNA conjugates will find its tag complement on the monolayer. After 
the oligonucleotide tags of the amplified samples are rendered single stranded as 

35 described above, the tag-cDNA conjugates of the samples are separately applied to the 
monolayers under conditions that pennit specific hybridization only between 
oligonucleotide tags and tag complements forming perfectly matched duplexes. 
Concentrations of the amplified samples and hybridization times are selected to 



-42- 



wo 97/13877 PCT/US96/16342 

permit the loading of about 5 x 10^ to 2 x 10^ tag-cDNA conjugates on each bead 
where perfect matches occur. After ligation, 9-12 nucleotide portions of the attached 
cDNAs are determined in parallel by the single base sequencing technique described 
by Brenner in International patent application PCT/US95/03678, Frequency 
5 distributions for the gene expression profiles are assembled from the sequence 
information obtained from each of the ten samples. 

RT-PCRs of selected mRNAs corresponding to cytochrome P-450 genes and 
the constitutively expressed cyclophilin gene are carried out as described in Morris et 
al (cited above). Briefly, a 20 ^iL reaction mixture is prepared containing Ix reverse 

1 0 transcriptase buffer (Gibco BRL), 1 0 nM dithiothreitol, 0.5 nM dNTPs, 2.5 oligo 
dCr)i5 primer, 40 units RNasin (Promega, Madison, WI), 200 units RNase H-reverse 
transcriptase (Gibco BRL), and 400 ng of total RNA (in diethylpyrocarbonaie-lreated 
water). The reaction is incubated for 1 hour at 37^0 followed by inactivation of the 
enzyme at 95^C for 5 min. The resulting cDNA is stored at -20^0 until used. For 

1 5 PGR amplification of cDNA, a 10 nL reaction mixture is prepared containing lOx 
polymerase reaction buffer, 2 mM MgCl2, 1 unit Taq DNA polymerase (Perkin- 
Elmer, Norwalk, CT), 20 ng cDNA, and 200 nM concentration of the 5' and 3' 
specific PGR primers of the sequences described in Morris et al (cited above). PGRs 
-are carried out in a Perkin-Elmer 9600 thermal cycler for 23 cycles using melting, 

20 annealing, and extension conditions of 94^C for 30 sec, 56^C for 1 min., and 72^C 
for 1 min.. respectively. Amplified cDNA products are separated by PAGE using 5% 
native gels. Bands are detected by staining with ethidium bromide. 

Western blots of the liver proteins are carried out using standard protocols 
after separation by SDS-PAGE. Briefly, proteins are separated on 10% SDS-PAGE 

25 gels under reducing conditions and immunoblotted for detection of P-450 isoenzymes 
using a modification of the methods described in Harris et al, Proc. Natl. Acad. Sci., 
88: 1407-1410 (1991). Protein are loaded at 50 jig/lane and resolved under constant 
current (250 V) for approximately 4 hours at 2^C, Proteins are transferred to 
nitrocellulose membranes (Bio-Rad, Hercules, CA) in 15 mM Tris buffer containing 

30 1 20 mM glycine and 20% (v/v) methanol. The nitrocellulose membranes are blocked 
with 2.5% BSA and immunoblotted for P-450 isoenzymes using primary monoclonal 
and polyclonal antibodies and secondary alkaline phosphatase conjugated anti-IgG. 
Immunoblots are developed with the Bio-Rad alkaline phosphatase substrate kit. 
The three types of measurements of P-450 isoenzyme induction showed 

35 substantial agreement. 



43- 



wo 97/13877 PCT/US96/16342 



APPENDIX la 

Exemplary computer prop-am for generating 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 

Program minxh 

c 

c 

c 

integer*2 subl (6) , mset 1 { 1000, 6 ) , mset2 ( 1000, 6 ) 
dimension nbase(6) 

c 
c 

write (*,♦) 'ENTER SUBUNIT LENGTH' 
readC, 100)nsub 
100 format (il) 

open (1, file='sub4 .dat ' , f orm= ' formatted ' , status= ' new* ) 

c 
c 

nset=0 

do 7000 ml=l, 3 
do 7000 m2=l, 3 
do 7000 m3=l, 3 
do 7000 m4=l, 3 
subl(l)=ml 
subl (2)=m2 
subl{3)=m3 
subl (4 ) =m4 

c 
c 

ndiff=3 

c 
c 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides. 

c Save in msetl. 

c 

c 

do 900 j=l,nsub 
900 msetld, j)=subl(j) 

c 

c 

do 1000 kl=l,3 
do 1000 k2=l,3 
do 1000 k3-l, 3 
do 1000 k4=l,3 

c 
c 

nbase(l)=kl 
nbase(2)=k2 
nbase (3)=k3 
nbase ( 4 ) =k4 



-44 



wo 97/13877 



PCT/US96/16342 



1 
3 



1200 



n=0 

do 1200 j=l,nsub 
if (subl ( j ) .eq. 1 
subl ( j ) .eq. 2 
subt ( j ) ,eq. 3 
n=n+l 
endi f 
continue 



.and. nbase ( j ) . ne . 1 .or. 
.and. nbase { j ) . ne . 2 . or . 
.and. nbase ( j ) . ne . 3) then 



if ( n . ge . ndif f ) then 



c 
c 
c 
c 
c 
c 
c 



1100 

c 
c 

1000 

c 

c 



DD=DD+1 
do 1100 i=l,nsub 

mset 1 ( j j , i } =nbase ( i ) 

endif 



continue 



1325 



do 1325 j2=i,nsub 
mset2 (1, j2 } =msetl {1, j2) 
inset2(2, j2)=msetl (2, j2) 



If number of mismatches 
is greater than or equal 
to ndiff then record 
subunit in matrix mset 



c 
c 

c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 

c 
c 

1700 



npass^'O 



continue 
Jck=npass + 2 
npass=npass+l 



Compare subunit 2 from 
msetl with each successive 
subunit in msetl, i.e. 3, 
4,5, ... etc. Save those 
with mismatches .ge. ndiff 
in matrix mset2 starting at 
position 2. 

Next transfer contents 
of mset2 into msetl and 
start 

comparisons again this time 
starting with subunit 3, 
Continue until all subunits 
undergo the comparisons. 



-45- 



wo 97/13877 



PCT/US96/16342 



2 
2 



1600 



1625 
1500 



do 1500 m=npass+2, j j 

n=0 ... 

do 1600 j=l,nsub 

if (msetl (npass+1, j) . eq. 1 . and .mset 1 (m, j ) . ne . 1 . or . 
msetl (npass + l, j ) . eq . 2 . and .mset 1 (m, j ) . ne . 2 . or . 
msetl (npass+1, j) . eq . 3 . and .mset 1 (m, j) .ne.3) then 
n=n + l 
endif 
continue 
if (n.ge. ndif f ) then 
kk=kk+l 

do 1625 i=l,nsub 

mset2(kk,i)=msetl {m,i) 

endif 
continue 



c 
c 
c 
c 
c 
c 



2000 



c 
c 



7009 

7009 
7010 



120 
7000 

c 
c 



kk is the number of subunits 
stored in mset2 

Transfer contents of mset2 
into msetl for next pass. 



do 2000 k-1, kk 

do 2000 m=l,nsub 

msetl (k,m)=raset2 (k,m) 
if(kk.lt.jj) then 
jj=kk 
goto 1700 
endif 



nset=nset+ 1 
write ( 1, 7009) 

format { / ) 
do 7008 k=l,kk 

write (1, 7010) (msetl (k,m) ,m=l, nsub) 
format ( 4il ) 
write ( * , * ) 

write(*,120) kk,nset 

format ( Ix, 'Subunits in set = * , i5, 2x, ' Set No=\i5) 
continue 
close (1) 



end 



-46- 



wo 97/13877 



PCTAJS96/16342 



APPENDIX lb 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 



Program tagN 

c . 
c 

c Program tagN generates minimally cross-hybridizing 

c sets of subunits given i) N — subunit length, and ii) 

c an initial subunit sequence. tagN assumes that only 

c 3 of the four natural nucleotides are used in the tags 



c 
c 



character*! subl (20) 

integer*2 mset (10000, 20) , nbase{20) 



write (*,*) 'ENTER SUBUNIT LENGTH' 

read (*, 100) nsub . 
100 format (i2) 

c 
c 

write (*,*} 'ENTER SUBUNIT SEQUENCE' 
read(*, 110) (subl (k) , k=l,nsub) 
110 format (20al) 



c 
c 



c 
c 



ndif f=i0 



Let a=l c=2 g=3 & t=4 



do 800 kk=l,nsub 

if ( subl (kk) . eq . ' a ' ) then 

mset (1, kk) =1 

endif 

if (subl (kk) .eq. *c* ) then 
mset (1, kk) =2 
endif 

if (subl (kk) .eq. 'g' ) then 
mset ( 1 , kk) =3 
endif 

if (subl (kk) .eq. * t ' ) then 
mset ( 1 , kk) =4 
endif 



800 continue 



C 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides. 



do 1000 ki=l,3 



-47- 



wo 91!UW11 



PCTAJS96/16342 



c 
c 



do 1000 k2=l,3 
do 1000 k3=l, 3 
do 1000 k4=l,3 
do 1000 k5=l,3 
do 1000 k6=l, 3 
do 1000 k7=l, 3 
do 1000 k8=l,3 
do 1000 k9=l,3 
do 1000 kl0=l,3 

do 1000 kll=l,3 
do 1000 kl2=l,3 
do 1000 kl3=l,3 
do 1000 kl4=l, 3 
do 1000 kl5-l,3 
do 1000 kl6=l, 3 
dc 1000 kl7=i,3 
do 1000 kl8=l,3 
do 1000 kl9=l, 3 
do 1000 k20=l,3 



nbase 


(1) 


= kl 


nbase 


(2) 


= k2 


nbase 


(3) 


= k3 


nbase 


(4) 


= k4 


nbase 


(5) 


= k5 


nbase 


(6) 


= k6 


nbase 


(7) 


=k7 


nbase 


(8) 


=k8 


nbase 


(9J 


= k9 


nbase i 


[10 


)=kl0 


nbase I 


[11 


)=kll 


nbase I 


[12 


)=kl2 


nbase ( 


:i3 


)=kl3 


nbase \ 


:i4 


)=kl4 


nbase ^ 


:i5 


)=kl5 


nbase { 


:i6 


)=kl6 


nbase { 


:i7 


)=kl7 


nbase i 


'18 


)=kl8 


nbase < 


:i9 


)-kl9 


nbase I 


'20 


)=k20 



do 1250 nn=l, j j 



1 
2 
3 



1200 

c 

c 



1250 
c 



n=0 

do 1200 j-l,nsub 

if (msec (nn, j ) .eq. 1 
mset (nn, j ) ,eq . 2 
mset (nn, j ) . eq. 3 
mset (nn, j ) . eq . 4 
n=n+l 
endif 
continue 



and. nbase ( j ) . ne 

and . nbase ( j ) . ne 

and . nbase ( j ) . ne 

and. nbase ( j ) . ne 



1 . or . 

2 . or . 

3 . or . 
4) then 



if (n. It .ndif f } then 

goto 1000 

endi f 
continue 



j j=j j+1 

write (*, 130) (nbase (i) ,i=l,nsub) ,ij 
do 1100 i=l,nsub 



48 



wo 97/13877 



PCT/US96/16342 



mset ( j j , i ) =nbase ( i ) 
1100 continue 

c 

1000 continue 

c 

c 

write ( * , * ) 
130 format (lOx, 20 (Ix, il ) , 5x, i5) 

write (*, *) 

write(*,120) jj 
120 format ( Ix, ' Number of words=^*,i5) 



c 

c 
c 
c 



end 



-49- 



wo 97/13877 PCT/US96/1 6342 



APPENDIX Ic 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(double stranded tag/single stranded tag complement) 

Program 3tagN 

c 

c 

c Program 3tagN generates minimally cross-hybridizing 

c sets of duplex subunits given i) N — subunit length, 

c and ii) an initial homopurine sequence, 

c 

c 

character*l subl (20) 

integer*2 mset ( 10000, 20 ) , nbase(20] 

c 

write (*,*) 'ENTER SUBUNIT LENGTH' 

read! *, 100)nsub 
100 format (12) 

c 
c 

write (*, M 'ENTER SUBUNIT SEQUENCE a & o only' 
read(MVO) (subl (k) , k=l,nsub) 
110 format (20al) 



c 

c 
c 



ndif f=10 

. Let a=l and g=2 



do 800 kk=l,nsub 

if (subl (kk) .eq. 'a* ) then 
mset (1, kk)=l 
endif 

if (subl (kk) .eq. 'c* ) then 
mset (1, kk)=2 
endif 

800 continue 



3D = 1 

do 1000 kl=l,3 
do 1000 k2=l, 3 
do 1000 k3=l, 3 
do 1000 k4=l,3 
do 1000 k5=l,3 
do 1000 k6=l,3 
do 1000 k7=l,3 
do 1000 k8=l, 3 
do 1000 k9=l, 3 
do 1000 kl0=l,3 

do 1000 kll=l,3 
do 1000 kl2=l, 3 
do 1000 kl3=l,3 
do 1000 kl4=l,3 
do 1000 kl5=l,3 
do 1000 kl6=l, 3 
do 1000 kl'? = l,3 
do 1000 kl8=l, 3 



50- 



wo 97/13877 



PCT/US96/16342 



dc 1000 kl9=l,3 
do 1000 k20=l,3 



\ 

nbase 


(1) = 


kl 


nbase 


(2) = 


k2 


nbase 


(3) = 


k3 


nbase 


(4 ) = 


k4 


* 

nbase 


(5) = 


k5 


nbase 


(6) = 


k6 


nbase 


(7) = 


kv 


nbase 


(8) = 


k8 


nbase 


(9) = 


k9 


nbase 


(10) 


= kl0 


nbase 


(11) 


= kll 


nbase 


(12) 


=kl2 


nbase 


(13) 


=kl3 


nbase 


(14) 


=kl4 


nbase 


(15) 


«kl5 


nbase 


(16) 


=kl6 


nbase 


(17) 


=kl7 


nbase 


(18) 


=kl8 


nbase 


(19) 


= kl9 


nbase 


(20) 


=k20 



do 1250 nn=l, j j 

c 

n=0 

do 1200 j=l,nsub 

if (mset (nn, j ) . eq . 1 .and. nbase (j ). ne . 1 .or. 

1 mset (nn, j ) . eq . 2 .and. nbase ( j ) . ne . 2 .or. 

2 mset (nn, j ) . eq . 3 .and. nbase ( j ) . ne . 3 .or. 

3 mset (nn, j ) . eq . 4 .and. nbase ( j ) . ne . 4 ) then 

n=n+l 

endif 

1200 continue 
c 

if (n. It .ndif f ) then 
goto 1000 
endi f 

1250 continue 
c 

jj=j j+1 

write (■", 1 30) (nbase { i ), i = l , nsub) , j j 
do 1100 i=l,nsub 

mset ( j j , i ) =nbase ( i ) 
1100 continue 
c 

ICOQ continue 
c 

write ( * , * ) 
130 format ( lOx, 20 ( Ix, il ) , 5x, i5 } 

write!*,*) 

write(*,120) jj 
120 format ( Ix, • Number of wcrds=*,i5} 

c 
c 

end 



-51 - 



wo 97/13877 PCT/US96/16342 



SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(i) APPLICANT: David W. Martin, Jr. 

(ii; TITLE OF INVENTION: Measurement of Gene Expression profiles m 
Toxicity Determination 

(iii) NUMBER OF SEQUENCES: 7 

(iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE: Stephen C. Macevicz, Lynx Therapeutics, Inc. 

(B) STREET: 3832 Bay Center Place 

(C) CITY: Hayward 

(D) STATE: California 

(E) COUNTRY: USA 

(F) ZIP: 94545 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: 3.5 inch diskette 

(B) COMPUTER: IBM compatible 

(C) OPERATING SYSTEM: Windows 3.1 

(D) SOFTWARE: Microsoft Word 5.1 

(vi) CURRENT APPLICATION DATA: 

(A) APPLICATION NUMBER: 

(B) FILING DATE: 

(C) CLASSIFICATION: 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: PCT/US96/0951 3 

(B) FILING DATE: 06-JUN-96 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: PCT/US95/12'? 91 

(B) FILING DATE: 12-OCT-95 

(viii) ATTORNEY /AGENT INFORMATION: 

(A) NAME:- Stephen C. Macevicz 

(B) REGISTRATION NUMBER: 30,285 

(C) REFERENCE/DOCKET NUMBER: 813wo 

(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: (510) 670-9365 

(B) TELEFAX: (510) 670-9302 



(2) INFORMATION FOR SEQ ID NO: 1 



(ij SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



-52 



wo 97/13877 



PCT/US96/16342 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1: 



CTAGTCGACC A 11 



(2) INFORI^TION FOR SEQ ID NO: 2: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 
(C; STRANDEDNESS: single 
(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 



NRRGATCYNN N - 11 



(2) INFORMATION FOR SEQ ID NO: 3: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 38 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 



GAGGATGCCT TTATGGATCC ACTCGAGATC CCAATCCA 38 



(2) INFORf^zi.TION FOR SEQ ID NO: 4: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 
(3) TYPE: nucleic acid 
iC] STRANDEDNESS: double 
{D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 



AGTGGCTGGG CATCGGACCG 2C 



(2) INFORMATION FOR SEQ ID NO: 5: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 



-53- 



wo 97/13877 



PCTAJS96/16342 



(C) STRANDEDNESS: double 
iO) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 



GGGGCCCAGT CAGCGTCGAT 



20 



(2) INFORMATION FOR SEQ ID NO: 6 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 
(E) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6 



ATCGAC.GCTG ACTGGGCCCC 



16 



(2) INFORMATION FOR SEQ ID NO: 7: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 62 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



ixi) SEQUENCE DESCRIPTION: SEQ ID. NO: 7 



AA.AA.GGAGGA GGCCTTGATA GAGAGGACCT GTTTAAACGG ATCCTCTTCC 
TCTTCCTCTT CC 



50 
62 



-54- 



wo 97/13877 PCT/US96/16342 

I claim: 

1 . A method of determining the toxicity of a compound, the method comprising 
the steps of: 

5 administering the compound to a test organism; 

extracting a population of mRNA molecules from each of one or more tissues 
of the test organism; 

forming a separate population of cDNA molecules from each population of 
mRNA molecules from the one or more tissues such that each cDNA molecule of a 
1 0 separate population has an oligonucleotide tag attached, the oligonucleotide tags 
being selected from the same minimally cross-hybridizing set; 

separately sampling each population of cDNA molecules such that 
substantially all different cDNA molecules within a separate population have different 
oligonucleotide tags attached; 
1 5 sorting the cDN A molecules of each separate population by specifically 

hybridizing the oligonucleotide tags with their respective complements, the respective 
complements being attached as uniform populations of substantially identical 
complements in spatially discrete regions on one or more solid phase supports; 

determining the nucleotide sequence of a portion of each of the sorted cDNA 
20 molecules of each separate population to form a frequency distribution of expressed 
genes for each of the one or more tissues; and 

correlating the frequency distribution of expressed genes in each of the one or 
more tissues with the toxicity of the compound. 

25 2. The method of claim 1 wherein said oligonucleotide tag and said complement 
of said oligonucleotide tag are single stranded. 

3. The method of claim 2 wherein said oligonucleotide tag consists of a plurality 
of subunits, each subunit consisting of an oligonucleotide of 3 to 9 nucleotides in 

30 length and each subunit being selected from the same minimally cross-hybridizing set. 

4. The method of claim 3 wherein said one or more solid phase supports are 
microparticles and wherein said step of sorting said cDNA molecules onto the 
microparticles produces a subpopulation of loaded microparticles and a subpopulation 

35 of unloaded microparticles. 

5. The method of claim 4 further including a step of separating said loaded 
microparticles from said unloaded microparticles. 



-55- 



wo 97/13877 



PCT/US96/16342 



6. The method of claim 5 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is at least 10,000. 

5 

7. The method of claim 6 wherein said number of loaded microparticles is at 
least 100,000. 

8. The method of claim 7 wherein said number of loaded microparticles is at 
0 least 500,000. 



0 



9. The method of claim 5 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is sufficient to estimate the relative abundance of a cDNA molecule 
present in said population at a frequency within the range of from 0.1% to 5% with a 
95% confidence limit no larger than 0. 1% of said population. 

10. The method of claim 4 wherein said test organism is a mammalian tissue 
culture. 

1 1 . The method of claim 1 0 wherein said mammalian tissue culture comprises 
hepatocytes. 

12. The method of claim 4 wherein said test organism is an animal selected from 
5 the group consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and 

monkeys. 

13. The method of claim 12 wherein said one or more tissues are selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid, spleen, adrenal, large 

0 intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 
mesenteric lymph nodes. 



14. A method of identifying genes which are differentially expressed in a selected 
5 tissue of a test animal after treatment with a compound, the method comprising the 
steps of: 

administering the compound to a test animal; 



-56 



wo 97/13877 PCT/US96/16342 

extracting a population of mRNA molecules from the selected tissue of the 
test animal; 

forming a population of cDNA molecules from the population of mRNA 
molecules such that each cDNA molecule has an oligonucleotide tag attached, the 
5 oligonucleotide tags being selected from the same minimally cross-hybridizing set; 

sampling the population of cDNA molecules such that substantially all 
different cDNA molecules have different oligonucleotide tags attached; 

sorting the cDNA molecules by specifically hybridizing the oligonucleotide 
tags with their respective complements, the respective complements being attached as 
1 0 uniform populations of substantially identical complements in spatially discrete 
regions on one or more solid phase supports; 

determining the nucleotide sequence of a portion of each of the sorted cDNA 
molecules to form a frequency distribution of expressed genes; and 

identifying genes expressed in response to administering the compound by 
1 5 comparing the frequencing distribution of expressed genes of the selected tissue of the 
test animal with a frequency distribution of expressed genes of the selected tissue of a 
control animal. 

. 15. The method of claim 1 4 wherein said oligonucleotide tag and said 
20 complement of said oligonucleotide tag are single stranded. 

16. The method of claim 1 5 wherein said oligonucleotide tag consists of a 
plurality of subunits, each subunit consisting of an oligonucleotide of 3 to 9 
nucleotides in length and each subunit being selected from the same minimally cross- 

25 hybridizing set. 

17. The method of claim 16 wherein said one or more solid phase supports are 
microparticles and wherein said step of sorting said cDNA molecules onto the 
microparticles produces a subpopulation of loaded microparticles and a subpopulation 

30 of unloaded microparticles. 

18. The method of claim 17 further including a step of separating said loaded 
microparticles from said unloaded microparticles. 

35 19. The method of claim 1 8 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is at least 1 0,000. 



-57- 



wo 97/13877 



PCT/US96/16342 



20. The method of claim 19 wherein said number of loaded microparticles is at 
least 100.000. 

21. The method of claim 20 wherein said number of loaded microparticles is at 
5 least 500,000. 



22. The method of claim 1 8 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is sufficient to estimate the relative abundance of a cDNA molecule 
present in said population at a frequency within the range of from 0.1% to 5% with a 
95% confidence hmit no larger than 0.1% of said population. 



10 



15 



20 



35 



23. The method of claim 17 wherein said test animal is selected from the group 
consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and monkeys. 

24. The method of claim 23 wherein said selected tissue is selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid, spleen, adrenal, large 
intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 

- mesenteric lymph nodes. 



25. A use of the technique of massively parallel signature sequencing to determine 
the toxicity of a compound in a test organism, the use comprising the steps of: 
administering the compound to a test organism; 

extracting a population of mRNA molecules from each of one or more tissues 
25 of the test organism and forming a population of cDNA molecules for each of the one 
or more tissues; 

determining the nucleotide sequence of a portion of each of the cDNA 
molecules of each separate population using massively parallel signature sequencing 
to form a frequency distribution of expressed genes for each of the one or more 
30 tissues; and 

correlating the frequency distribution of expressed genes in each of the one or 
more tissues with the toxicity of the compound. 



26. The use of claim 25 wherein said test organism is a mammalian tissue culture. 

27. The use of claim 26 wherein said mammalian tissue culture comprises 
hepatocyles. 



-58- 



wo 97/13877 



PCTAJS96/16342 



28, The use of claim 25 wherein said test organism is an animal selected from the 
group consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and 
monkeys. 

5 29. The use of claim 28 wherein said one or more tissues are selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid, spleen, adrenal, large 
intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 
mesenteric lymph nodes. 

10 30. A use of the technique of massively parallel signature sequencing to identify 
genes which are differentially expressed in a test organism after treatment with a 
compound and which are correlated with toxicity of the compound, the use 
comprising the steps of: 

administering the compound to the test organism; 
1 5 extracting a population of mRN A molecules from a selected tissue of the test 

organism and forming a population of cDNA molecules; 

determining the nucleotide sequence of a portion of each of the cDNA 
molecules using massively parallel signature sequencing to form a frequency 
. distribution of expressed genes; 
20 identifying genes expressed in response to administering the compound by 

comparing the frequencing distribution of expressed genes of the selected tissue of the 
test organism with a frequency distribution of expressed genes of the selected tissue 
of a control organism; and 

determining whether the genes expressed in response to administering the 
25 compound are correlated with toxicity of the compound in the test organism. 



59 



wo 97/13877 



PCT/US96/16342 



1/2 



100 




Generate Table Mn of 
all possible subunlts of 
desired length and composition 



110 



1 



Select initial subunit 
Si(i=1) 



120 



Compare subunit Si to 
successive subunits in 
Table Mn from S+1 to 
end of Table 



125 



Save subunit in 
Table Mn+^ 



Does 
subunit meet 
criteria?. 




130 




Discard subunit 
Go to next subunit 



150 



Yes 




Replace Mn 
with Mn+1 




No 




. 1 



SUBSTITUTE SHEET (RULE 26) 



wo 97/13877 



PCT/US96/16342 



2/2 



350 




352 




354 








330 




332 




334 


4 




4 





318 




COMPUTER 




/ 
304 

312 



UGHT 
SOURCE 



Fig. 2 




SUBSTITUTE SHEET (RULE 26) 



INTERNATIONAL SEARCH REPORT 



Intemfttioiul Rf^Ucadon No. 
PCT/US96/16342 



A. CLASSinCATlON OF SUBJECT MATTER 

IPC(6) : C12Q 1/68 ; C07H 21/04 

US CL : 435/6; 536/24.3 
According to International Patent Claasificatton (IPC) or to both national claAiification and IPC 

B- HELPS SEARCHED 

Minimum documentation vearcbed (claatification lystem followed by classification symbols) 

U.S. : 435/6; 536/24.3 



Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consuhed during the international search (ruunc of data base and. where practicable, search terms used) 
APS, MEDLINE. BIOSIS, CAPtUS, SCISEARCH 

search terms: Martin, David W., toxic?, differential?, express?, cDNA, nriRNA, RNA, ger>e#, hybrid?. 



C. 



<:UMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document, with indication, where appropriate, of the relevant passages 



Relevant to claim No. 



CHETVERIN et al. Oligonucleotide arrays: New concepts and 
possibilities. Bio/Technology. 12 November 1994, Vol. 12, 
pages 1093-1099, especially pages 1095-1096. 

BRENNER et al. Encoded combinatorial chemistry. 
Proceedings of the National Academy of Sciences USA. June 
1992, Vol. 89, pages 5381-5383. 

MATSUBARA et al. cDNA analyses in the human genome 
project. Gene. 15 December 1993, Vol. 135, No. 1-2, pages 
265-274. 



1-30 



1-30 



1-30 



I x| Further documents are listed in the continuation of Box C. | | Sec patent family annex. 



Speckl categofie* of ciiod documoiD: 

dooMscBldelinini the |coerml Hue of the an which li aol oootidend 

to be of I 



•E' 

•0" 
.p. 



earlier document pubtitbed oo or aAcr the BlfrutkNMd fUmg date 



docvoKOt which may throw doubti on priority ckuni(i) or which ii 
ctted to flibtiih the pubbcatioo dale of aootfaer citaliaci or other 
tpecial reaaoo (aa apeciTwd) 



docwDcoi referrioc to an oral diKrioaure. uae. cxhA>n>oo or other 

documcnl published prior lo the Bieraatioaal fUn| dale but blcr tfaao 
the pfiorily date ckiiaed 



later documc&i publiabed after the tntemaiiooaJ fUing date or priority 
dale aod oot to conflict with the applicatioo but cited to undcrataod the 
priacipk or itkeory uaderiyin| the ioveotioQ 

documeot of particular relevance; ibc claimed invention cannot be 
cooiidered novel or caooot be cowidered lo involve an inventive etep 
wbea the docuoMot i» lakca alone 

documeot of particular rclevaoce; the cUiroed iDVcstion canool be 
comidered to involve an inventive alep when the document ii 
combined with ooc or more other fucfa documcnU. auch combination 
beinf obvious to a pcraoo akilled in the ait 

document member of the same patent family 



Date of the actual completion of the international search 
27 JANUARY 1997 


Date of mailing of the international search report 

19FEB1997 . I, 


Name and mailing address of the IS AAJS 
Commifuoncr of Patenti snd Trademarks 

Box PCT 

Washington. D.C. 20231 
FacsimUc No. (703) 305-3230 


Authorized officelj , ' { 'I- -l .'*^; / - 
SCOTT D. PRIEBE j 

4 

Telephone No. r703) 308-O196 



Form PCT/lSA/210 (second 5heet)(JuIy 1992)* 



INTERNATIONAL SEARCH REPORT 



Intenutioiul application No. 
PCT/US96/16342 



C (Continuation). DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document, with indication, where appropriate, of the relevant passages 



WO 95/21944 Al (SMTTHKLINE BEECHAM CORPORATION) 
17 August 1995, page 4, lines M, page 5, lines 31-37, page 17, 
lines 15-27, page 18. lines 30-35, page 20, line 23 to page 21, 
line 4, 



Form PCT/ISA/210 (continuation of second $heet)(July 1992)* 



Relevant to claim No 



1-30 



Reference 8 of 20 

with Response dated 05/04/04 

InUSSN: 09/857,826 



FOCUS - 1 7 of 1 9 DOCUMENTS 

Copyright 1997 PR Newswire Association. Inc. 

PR Newswire 

August 11, 1997. Monday 

SECTION: Financial News 

DISTRIBUTION: TO BUSINESS AND MEDICAL EDITORS 
LENGTH: 478 words 

HEADLINE: Eli Lilly & Co. and Acacia Biosciences Enter Into Research Collaboration; 
First Coiporate Agreement for Acacia's Genome Reponer Matrix(TM) 

DATELINE: RICHMOND. Calif., Aug. 11 
BODV: 

Acada Biosciences and Eli Lilly and Company (Lilly) announced today the signing of a joint research collaboration 
to utilize Acacia's Genome. Reporter Matrix(TM) (GRM) to aid in the selection and optimization of lead confounds. 
Under the collaboration, Acacia will provide chemical and biological profiles on a class of Lilly's compounds for an 
imdisclosed fee. 

Acacia's GRM is an assay-based computer modeling system that uses yeast as a miniature ecosystem. The GRM 
can profile the extent, nature and quantity of any changes in gene expression. Because of the similarities between 
the yeast and human genome, the s>'stem serves as an excellent surrogate for the human body, mimicking the effects 
induced by a biologically active molecule. 

"Using yeast as a model organism for lead optimization makes a lot of sense given the high degree of homology with 
human metabolic pathways," said William Current of Lilly Research Laboratories. "Acacia's innovative GRM has 
the potential to provide enormous insight into the therapeutic impact of our compounds and make the drug discovery 
process more rational. It should substantially accelerate the development process. " 

"This first agreement with a major pharmaceutical con^any is an important milestone in the development of 
Acacia," said Bruce Cohen, President and CEO of Acacia. "The deal is in line with our strategy of establishing 
alliances dial will allow our collaborators to use genomic profiles to identify and optimize compounds within 
their existing portfolios. In the long run, this technology can be used to characterize large scale combinatorial 
libraries, predict side effects prior to clinical trials and resurrect drugs that have failed during clinical trials." 

The GRM incorporates two critical elements: chemical response profiles and genetic response profiles. The 
chemical response profiles measure the change in gene expression caused by potential therapeutics and then rank genes 
with altered expressions by degree of response. The genetic response profiles measure changes in gene expression 
caused by mutations in the genes encoding potential targets of pharmaceuticals; these genetic response profiles represent 
gold standards in drug discovery by defining the response profile expected for drugs with perfect selectivity and 
specificity. By con^)aring the two profiles, one can analyze a potential drug candidate's ability to mimic the action of 
a 'perfect' drug. 

Acacia Biosciences is a functional genomics company developing proprietary technologies to enhance the speed 
and efficacy of drug discovery and development. Acacia's Genome Reporter Matrix capitalizes on the latest advances 
in genomics and combinatorial chemistry to generate con^rehensive profiles of drug candidates' ki vivo activity. 
SOURCE Acacia Biosciences 

CONTACT: Bmce Cohen, President and CEO of Acacia Biosciences, 510-669-2330 ext. 103 or Media: Linda 
Seaton of Feinstein 

LOAD-DATE: August 12. 1997 



Reference 9 of 20 

witli Response dated 05/04/04 

InUSSN: 09/857,826 



'th 2 Bloreactor Market: 

S' idy Grawtti Cxpoctad 

far ifl bkx^octori wii vsiu«4 

M $37S mlQton tor I W. 
Utd H v^tctit^ Co bft tow 

Tlf — *^rtwt. ftn* 

^ - CE281M 

^ c.-o^ 12:it^. , ««^7 

Ti: CEVPTr^ ^ C04S7SOOO 





ENGIMEERING 

IMEWS 



BIOTEC 



00/25/97 



• BIOPROCESS 




BiaRESEARCH • TECHMOLOGY TRANSFER 



Pharmagene 
Raises More 
Capital for 
Research on 
Human 
Tissues 

By Sophia Foi 

PhannflgeBC, the Rpyston, 
ULK.-b^ed biophannaceun* 
cal co mpan y specialising in 
the use of human bioniatenals for 
drug discoveiy rtseareh, has raised a 
further £5 million from a group of 
investors led by 3 i and Abacus 
Nominees. The funding will enable 
ihe company to expia^ both its 
human tnomaterials collection and 
its capabilities across a range of pro- 
pfictaiy platform technologies. 

Gordon Baxter. Fh.D., 
Pharmagene^ cofounder and chief 
operating ofTicer. clalnwd "by the 
end of this year Pharmagene will 
have access to the largest collection 
of human RNAs and proteins any- 
where in the world, and a range of 
innovative, >el robust technologies 
SEE PHARMAGENE, P. 0 



Perkin-Elmer Acquires PerSeptive to Expand 
Its Capabilities in tGene-BasedDrug Discovery 



By John Sterling 

Peridn-Elmerli (PE: htorwalk. 
CD dcctnon last month to 
acquire PerSeptive Blo- 
systems (Framingham, MA) via a 
S360 million stock swap was 
designed to strengthen PE in temis 
of broad capabilities in gene-based 
drug discovery. The company'^ 
main goal is to develop new prod- 
ucts to imprme dte integration of 
genetic and protein research. 

"This merger will enhance our 
position as an eflTectivc provider of 
innovative, integrated platforms 
enabttng our customers to be more 
efr ^cient and cost-effective in bring- 
ing new pharmaccuticifls-' to mar- 
ket." says Tony L. White. PE^ 
chairman, president and CEO. "The 
combination of our two con^anies 
should bolster our presence in the 
life sciences, [and it is our] belief 
that we mast take bold action now 
to lead the emerging era of molecu- 
lar medicine with leading positions 
in both genetic and protein analy- 
sis.- 

A driving force behind the 
ntergcr is the vast amount of genet- 



•Iff. J- 



BeHdn- 
Ehner 
tKtjuiftd 
FsrSeptive 
Biosysterni'' 
forV60 
million to 
obtain nen'f^ 
technologies 
inmaa 
spectwmt'; 
ay, biosepo- 
fotionx otu^ 
paificaticn. 
jarproducVt 
devdapmeht 
projects, 
■ :parmitig the 
rvngefrom 
fscwmievto 
fnotetmicj^. 



ic information about human dis- 
ea.se that is being accumubtcd by 
researchers and btotech companies 
working in the area of genomics. It 
is becoming increasingly obvious 
that these data need to be comple- 
mented with technologies for 



studying proteins and protein net- 
works — a field known as pro- 
tcomics (.ttr GEN. September t. 
1997, p. 1% 

PE officials, who claim that 
MALDl-TOF (Matrix Assisted 
SEE AComsmON. P. 10 



FDA OKs Genzyme's Carticd 
Product for Damage to Knees 



Strategies for Target Validation 
Streamfine Evaluation of Leads 



1 

f J \Jt Defect 




- Periosteal flap — , 



Biopsy I ^i^l ^ 



G#;nryme ThsiiR Repair 



Cell Processing 



Carticel 



Carticei, which was approved Jhr the repair of dinkaliy significant, symp- 
tomatic cartilaginous defects cf the femorat condyle (medial iaterat or 
twdil&ir) caused by acute or repetitive trauma, employs a proprietary 
prrxyxf to gmw auiolngnui cardlage cetlxfor implantation. 



By Naomi PfeifTcr 

The FDA has approved a fcnee- 
cartilagc leplacemem product 
made by Genzyme Tissue 
Repair (Cambridge. MA), a track- 
ing-stock division of Genzyme 
Corp^ for people with trauma- 
darragod knees. 

Carticel" (autologous cultured 
chondrocytes) is the first product to 
be licensed under the FDA^ pro- 

SCEOENZVME,P.« 



RARY OF 



By VkM Glaser 

A cacia Biosciences (Rich- 
/\ mond, CA) last montfi 
Jl Xkannounced its fust agree- 
ment with a major pharmaceutical 
company, signing a deal with Eli 
Uny (Indianapolis. IN) to use 
Acacia^ Genome Repcvter Matrix 
(GRM) to select and optimize some 
of Ully^ lead awnpounds. Acacia^ 
yeast-ba.'ced system for profiling 
drug activity is useful for evaluating 
tfie therapeutic potential of lead 
comp ounds, and it also has a role in 
the identirication and validation of 
new drug targets. 

••^Ve using ihc ecosystem of a 
cell to allow us to deduce the mech- 
anism of action and target for any 
chemical." explains Bruce Cohen, 
president and CEO. "We screen for 
every target in a cell simultaneous- 
ty.„using transcription as a readout 



for how a cell is adapting to any 
perturbation.*' he say& 

The GRM technology consists of 
two main databases: one is the 
genetic response profile, showing 
the effects of mutatioas in each 
individual yirast gene and compen- 
satory gene regulatory mecha- 
nisms; the other is the chenucal 
response profile, whidi documents 
changes in gene cxpressioo in 
response to dtcmical compounds. 
Computational analysis and |iancm 
matching between the geiietic and 
chemical profiles ^elds informa- 
tion on the specificity, potency and 
side-efTects risk of a drug lead. 

Tattling Tar^tets 



No longer is mapping and 
sequencing a gene — or the human 
genome — on end unto itself, but 
SEETAROEr.P.16 



Sticky Ends 

Avlgen received two 
grant a from the NIH & 
University of Cali- 
fornia for research 
on gene therapy for 
treatment of cancer fc 
HIV infections . . .MRL 
Pharmaceutical Servi- 
ces, of Reaton, VA, 
launched the TSK Bug 
Finder, which is able 
to locate & retrieve 
client -specified mi- 
croorganisms in real- 
time. . .Oenaia Slcor, 
Inc. will move Its 
corporate staff from 
San Diego to Irvine, 
CA, by end of year... 



FDA accepted HDA from 
Sapracor for levalbu- 
terol HCl Inhalation 
solution. . .An $11. 7M 
mezzanine financing 
has been closed by 
Activated Cell Thera- 
py, *rtiich changed its 
name to Dendraon Cor- 
poration. . .Astra AB 
will build major re- 
search facility in 
Waltham, KA, and is 
also relocating Astra 
Xroua research facil- 
ity from Rocheoter to 
Boston area . . . Prollf - 
Ix Ltd. team used a 
small peptide to in- 
hibit the E2F protein 
cofT^lex and induced 



apoptosis in mammali- 
an tumor cells... Ver- 
tex Phamaceutlcals , 
Inc. and Alpha Thera- 
peutic Corp. ended an 
agreement to develop 
for treatment 
of inherited hemoglo- 
bin disorders. . .Havl- 
Cyte received Phase 1 
SBIR grant for up to 
SIOO.OOO from NIH for 
developmenc of proto- 
type of ice NaviFlow 
technology for high- 
throughput screening 
. , .Cova&oa Inc. will 
Invest $21 million in 
expansion and renova- 
tion of ice facility 
in Indianapolis, IN. 



PUBLISHED BY tUlc^X, I 




w . NEW YORK 



Target 



merely n means to an end The criti- 
cal next step 15 to Xtiltdatc the gene 
and i» protein pnxluct as a potential 
dru^ target. 71k Human Genome 
Project conlinucs to pmducc a trea- 
sure chest of cxprcwcd sequence 
tag!! <EST<!) and a Ltntalizing array of 
complete gene sequences. 

Companies are applying a variety 
of functional genomic strategies to 
link genes to specific diseases and to 
mulligcnic phenotypes. Yet the ulti- 
mate challenge for pharmaceutical 
companies Ls to sift through all the 
sequence and difTcrential gene 
expression data to identify the best 
targets for drug disco^TTy. 

Spinnine off technology devel- 
oped at the University of North 
Carolina (Chape! Hill), Cytogen 
Corp. (Princeton, NJ) formed its 
wholly owned subsidiary AxCcll 
Bloscknces caiikt this year. The 
young company is building a protein 
interBction database, cataloging all 
the irUeiBctions the modular dornains 
of proteins can engage in with a 



range of ligands, in order to gain 
insight into protein function and to 
select the most critical interaction to 
target for drug dcvcloprncnt 

AnCcIl^ cloning-of-ligand-larBCW 
(COLT) technology employs •'recog- 
nition units** firom the company^ 
genetic diversity library (CDL) to 
map functional protein interactions 
and quantilatc their affinity. The 
company's intcr-funciional prolcom- 
ic database (IFP-dbasc) elucidates 
protein interaction networks and 
stnicture-activity relationships based 
on ligand affinity with protein mod* 
ular domains. 

Defining Disease Pathways 

Signal Pharmaceatkab, \acS 
(San Diego. CA) integrated drug tar- 
get and discovery effort is based on 
mappir^ gene-reguladng pathways in 
cells and identifying small molecules 
that regulate the activation of those 
genes. In collaboration vnth academ- 
ic researchers, the company has idcrv 
ttficd a large number of regulatory 
proteins in several mitogen-activatcd 
protein (MAP) kinase pathways 
(inchiding the JNK, FRK and p38 




B 




The Genome 
Reponcr 
Matrix dejncK 
a xtthset of a 
yixLst army. 
Eachcofam' 
harhon a CFP' 
ivptfrtcrrm' 
sinniftjr a 

ColUxiiveh: tJie 
army reports • 
the expression 
of allmrt 

A: Anvy in yvi- 
Ne light. 
B: image t^Jlu* 
orescent emis- 
sion from the 
otTvy. 

Acacia 
B^'Wfl f y nBS 



signaling pathways), which Signal is 
evaluating for the treatment of 
autoimmune, inflammatory, cardio- 
vascular and neumk)^c diseases, and 
cancer. Other target identirtcation 



programs focus on the NF-kB path- 
way, cstrogcn-fclated genes and cen- 
tra^)CT iph erel neiMJus system ^enes. 

Regulating cytokine production in 
immune and inflammatory disorders. 




'i:^cr>' . -t 



A strong chemical combination to help you grow. And flourish. 

Three hundrKi miilion dollars and ten yiears of haitl fturk. Thal'.s w'hat it costs lo bring v-oiir blotechnologj'- 
itiriwtl tlwripciitic tn the niarkt'tptao:. 
Vtliich means, no nwni for error. 

VSliidi means, in turn, j-ou'd be wise to lap into the combined capabilities of Malllnckrodl and JT.Baken 
dual sources, trusted names kx^mx chemical raw materials. 

TVt> separate GMP-produced brands ofFering the contml of a single qiialit\' s\Tilem and the comT?nience of a 
singic audit pnices*. 

Vie offer compi«hensi\'e product lines Including USP salts, bioreagents. high puritv' sol\-ents and 
chromatography products in Beaker to Bulk'* packaging for easy scale-up. 

C^ill l-80a-S«2-3S37. or access mirttijhKitc at IntpyAw^.mallhakcrcom. For dual chemical snitrces dedicated 
10 helping \tni grow. Flourish. Sucaxil! 





ALLINCKRODT JT.Baker 



® 



and nvxlifying bone metabolism to 
treat ostcojxvosis arc the focus of 
Signal^ collaboration with Tanabe 
Setyaka (Osaka Japan). Signal has 
partnered with Organon/Akzo 
N»hel (Netherlands) lo identify 
cstroQcn-TCSponsivc genes as targets 
for treating neurodegenerative and 
psychiatric discnscs, nthemsclerctsis 
and tschcmit), and with Roche 
Bfcractrnct (Pnlo Alto. CA) lo dcvcl- 
q> human peripheral nerve cdl lines 
for the discovery of treatments for 
pain and incontinence. 

Exclhls' (S. San Francisco. CA) 
.strategy for target sekxlion is to 
define disease pathways artd identify 
regulatory molecules that activate or 
inhibit those biochemical/genetic 
pathways. Based on the finding that 
these pathways are conserved across 
species, the company is studying the 
model genetic systems of Drosophila 
and CaenoHiabditis etegans. Using 
its PathFinder technology, Exelixu; 
systematically introduces mutations 
into the genomes of these model 
organisms, looking for mutations 
that enhance or suppress the target 
disease-related gene. These novel 
genes then become the basis of drug 
screening assays. 

Cadns Pharmaeeuttcal Corp. 
(TarrytowR, NY) is identifying sur- 
rog^c ligands to newly dncovered 
orphan G-protein coupled trans- 
membrane receptors of unknown 
flmction to determine the suitability 
of the receptors as drug targets. 
Inserting the novel receptor in a 
yeast system yickb a ligaixl that 
activates the receptor. Access to a 
surrogate ligand allows the cofTqiar|y 
to .screen for receptor antagonists in 
the yeast system. 

"The anta^Eonis) ptas the lairro- 
gate ligaml gives you two probes — 
an on probe and an off probe — 
which allows you to lode at func- 
tion." cxplaias David Webb, Ph.D., 
vp of research and chief scientific 
officer, A surrogate ligand also pro- 
vides information on vdiich Gi?n>- 
tein interacts with the orphan recep- 
tor and its associated signaling path- 
ways, further clarifying the role of 
the receptor as a potential drug tar- 
get. Cadus* collaboration with 
Smithiaine (Philadelphia) capital- 
izes on Cadus* ability to determine 
orphan receptor function, applying 
the technology to SmithK.line> pro- 
prietary, newly discovered G-pto- 
tein receptors. 

Cadus' recombinant yeast system 
can also be used to screen cell and 
tbsaic extract for natural ligands. 
ami the company is accelerating ns 
internal drugsiiscovcry effrats in the 
areas of cancer, inflammation artd 
allergy. A recent equity investment in 
AxioiD Btotecbnologtes (San Diego. 
CA) gave Cadus a license to Axiom^i 
htgh-thmughput pharmacologic 
screening system for lead optimiza- 
tion and discovery. 

As its name implies. 
cene/Networlcs (Alameda. CA) 
tbcuscs on identifying ^enc networks 
that contribute to mulitgenic (rticno- 
typcs aiHl complex disease process- 
es. The integration of mouse and 
human genetic studies fomts the 
basis of the technology. The Genome 
Tagged Mice databa.« in develop- 
ment will serve as a library of ralur- 
al mouse genetic and pherralypic 
variation. Disease-related genes 
identified in mice are then evaluated 
in human family- and popuIaHori- 
based studies to confirm their clini- 
cal rclc\:ancc and linkages to patho- 
physiologic imiLt. 

Blocking Gene Expression 

InactK-ating a gene known to be 
expressed in itssocialion with a par- 
ticular disca.v: b orw approach to 
idcnlifying appropriate Ihcrapculic 
targtfis. The target xiilidaliiwi and Ui»- 
co\'cry program at Rttxizyine 
Pharmaceuticals, lac. (Boulder. 
CO) applies the company's ribo>iymc 
tcehnoli»gy ii> ;ichtc\x' selective inhi- 
bition i>t"pcnc csprcssitw in cell cul- 
tua* and in nninials. 

Correlallon of the gene cxpnrs- 
sitin inhihiiiim wiih phcTKilypc can 
SEE TARGET. P. 38 



Ctrcto No. 61 on Reader Service Catd 



Sa-SET.'EMBER 15, 1097 GENETIC ENGINEERtNQ NEWS 




AxCell Biosciences sciattisa say their technology enables the npui and 
simple Junctional identification of the two essential molecular components 
of protein interaction networks: specific recognition units that bind distinct 
modular protein domains are identified and isokned using a combination 
structural/fimcrional approach that uses both peptide phase display Genetic 
JXversity Libraries (GDQ and bioinformatics. and doning <^ Ugand 
Targets (COLT) technok^ utilaes recognition units as fitncticnal probes to 
isolate fanulies of interactor proteins. 



Taiiget 

trumpsBS 15 

suggest the relative importance of 
the gene in disease pathology. The 
company^ nuc lease-resistant 
ribozyincs form the basis of a col>. 
laboration with Scbmng AG, 
(Germany) for drug target validation . 
and the develc^nneni of ribozyme- 
based therapeutic agents, and with' 
Chiron Corpw (Emeryville, CA) for 
target validation. 

Widi several antisense compounds; 
now progressing throu^ clinical trv 
als, the concept of using oli^orui- 
cteoddes to iniiibit gene acdvity is. 
not new. But rather liun focusing on 
therapeutics develqTment, Seqaitiir,- 
Inc. (Natidc, MA) is creating anti-- 
sense cwi^xninds for the purpose of 
determining gene function and vali-. 
dating drug targets. Clients typically 
provide the one-year-old company 
with the sequence (or of a 
potential gene target and, in return, 
Scquitur custom designs a series of 
three to six antisense compounds that 
yield a tfuee-Co-ten-fold inhibition of 
the target gene in cell culture. The 
company also provides oligofectins, 
a series of cationic lipids, to ddivcr 
the oligonucleotides to a variety of 
cultured pells. 

"Differential expression informa- 
tion is just for correlation, it doesn,!t 
tell function or confimi what vmuld 
be a good target," says Tod Woolf. 
Ph.D., director of technology devel- 
opment'at Sequitur. Whereas, anti- 
sense compounds will inhibit a tar- 
get Sequitur offers both [rf)ospho- 
rothioate DMA antisense com- 
pounds^ and its proprietary Next 
Generation chimeric oligonu- 
cleotides, which have a higher 
hybridization afTmity, greater speci- 
ficity and reduced toxicity, according 
to the con^iany. 



Mining Pathogen Genomes 



Companies such as Human 
Genome Sciences (HGS: Rockvillc. 
MD). locyte (Palo Alto, CA). 



Millennium Pharmaceutkab Inc. 
(Cambridge. MA) and Genome 
Tbcrapeuttcs (Waltham. MA) are 
relying on high-speed DNA sequenc- 
ing, positional cloning and other 
strategies to identify specific micro- 
bial genomic sites that would be 
good targets for infectious disease 
therapeutics. 

HGS recently completed sequenc- 
ing of the bacterial pathogen 
Streptococcus pneumoniae, which is 
the focus of an agreement with 
. Hoffmann-La Roche (Basel. 
Switzerland). Roche will use the 
sequence d^ to develop new and- 
infectives against S. pneumoniae, 
HGS and Roche have expanded their 
collaboration to include a nonexclu* 
sive license to access sequence infor- 
matira for the intestinal bacterium 
Enterxxoccus faecalis. 

Incyte Pharmaceuticals has com- 
pleted one- fold coverage of the 
Candida albicans genome, identify- 



ing 60% of the genes of this fungal 
pathogen. This genome will become 
part of the company^ PathoSeq 
microbial database, incyte recently 
introduced the ZooSeq animal gene 
sequence and expression datafese. 
The database will provide gerximic 
information across various species 
commonl>' used in preclinical drug 
testing, v^hich may help to better 
define potential dr^g targets. 

Millennium Pharmaceuticals con- 
tinues to report success in identifying 
novel drug targets, having rccentfy 
discovered a novel chemokine called 
neurotactin and a new class of MAD- 
rclated proteins that inhibit trans- 
forming growth factor beta (TGF-0) 
signaling. The company also 
received US. patent coverage for the 
tub genes, believed to play a role in 
cAiesity. and for the gene tl^t mcodes 
die protein mctasutin, wtiich ai^>e3rs 
to suppress metastasis in malignant 
melanoma. ■ 



P^gea 

tram pMQv 2& 

Smith, now a computer program- 
mer, is an expert in systems integra- 
don, tnlemet technologies and die 
^licadon of industrial engineering 
principles to the drug discovery 
process. Before co-founding Pangea, 
he was the manager of sofmre 
development at Attwney^ Briefcase;, 
a legal research software comparry. 

^ being "in the trenches" with 
custorhers and coUabomtcMrs. 
Betienson and Smith sensed the 
fhistiation of pharmaceubcal 
researchers whose incompatible 
tools have impeded their progress. 
According to Bcllenson, "Most of 
them are geared toward analyzmg 
one molecule at a time. It^ like emp- 
tying the ocean with an eye drop- 
per — en incompatible eye dropper at 
that A pharmaceutical company 
may have 30 different drug discov- 
ery teams with various ai^iroaches. 
The proUem is to manage the 
process of Gqjerimenting with a tot 
of different approaches, to automate 
\^e maintaining flexibility.** 

GeneWcffId 2.1 enables 'integra- 
tion of die endre target discovery and 
validation process,** Bellenson says. 
The commercial software padcage 
coordinates the entire process of 
sequence-data analysis uid can be 
integrated with other programs and 
databases, according to Smith, who 
acbis that it handles thousands of 
sequence results, organizes and auto- 
mates annotation and seamlessly 
interacts with growing genome data- 
bases. Simple forms and menus 
enable , users to tum raw sequence 
data into crucial knowledge for drug 
discovery by applying algmithms to 
sequences, creating custom analysis 
strategies and producing useful 
rcjKirts. without the need for writing 
computer code. Gene World 2.1 runs 
on a variety of platforms and cfpa2X- 
ii^ systems. 

Pairing industrial icladonal data- 
base-management systems with a 
web-browser interface, Pangea 
Operating System of Drug 
Discovery"' is an open^onqxiting 
framework that allows client/server 
and Java-enabled web-based tech- 
nologies to collect, organize and ana- 
lyze drug discovery information for 
pharmaceutical companies to simpli- 
fy and accelerate dn^ discovery. The 
technology unites automated 
genomics database analysis for dmg 
target site selection, chemical infor- 
mation database analysis and large- 
scale combinatorial chemistry pro- 
ject management and higMhrough- 
put screening project management 
for drug lead efficacy analysis. 
F^ea officials maintain that diese 
integrated elements provide a unified 
environment for diemists, biolog^ 
and others involved in the drug dis- 
covery process to wotic together with 



Europe 

GTAC Chairman, Professor 
Norman C. Ncvin, said 1996 saw 
'*four trrqx>rtant developments'*: an 
increase in enquiries and submis- 
sions made to GTAC; en increase in 
the complexity of submitted proto- 
cols; a continuing shift from gene 
therein for sirtgle-gene disorders 
toward strategics aimed at tumour 
destruction in cancer; and a growth 
in intemadonal sponsorship of UK. 
gene therapy trials. 

Since 1993. GTAC and its prede- 
cessor, the Clothier Committee, have 
approved |g UK. gene therapy clini- 
cal trials (13 of which have been car- 
ried out), which are listed in the 
report. The disease areas targeted by 
these trials inchide severe combined 
immunodeficiency (1 trial), cysdc 
fibrosis (6), metastatic melanoma (2X 
lymphoroa (2), neuroblastoma (IX 
breast carwer {\\ Hurler^ svndnxnc 
(1). cervical cancer (IK gticmastoma 



SUBMIT SEl Of 10,000 StOUENCES; 



Remove vector 



Matk Hpetttivt riementt 



Mask imbtfuous regiom 




Crest* new Xytokinc' set froir 
hits with kcyMTd • 'Cytokine 



Multiple sequence rttgnmcnt 

i 

Identify comervcd tfomtins 



BioittfoRiutkHts can iteitgn end 
Hwc Stntegict, utdi ■> the one 
ttewn here, thtt datV 
thnwgh multiplr-ttep Miatytei 
(osfuUy uHt *ittomatic«Uy. 
Rncwcbcn ihiaushout your * 
orgwizition csn ipply Uie Mm 
Stritegies to then own d«ta. 



commercial and public domain 
software. 

Pangea^ Operating System of 
Drug Discovery can accommodate 
Syfo^ Oracle or Informix relation- 
al database-management systems 
and any version of UNIX. It absoite 
new data formats, databases, algo- 
rithms and analysis paradigms into 
the automated workflow without 
software modifications. Netscape 
Navigator" provides a friendly user 
interface from PC, Macintosh, and 
UNIX workstations. 

In the near term. Panged plans to 
complete its bioinformatics core 
with two more programs. Gene 
Foundry, a sample tracking and 
workflow sequence package for 
DNA sequeiKe and fragment infor- 
mation, will also offer iiiteraction 
with robots, reagent tracking and 
trouUeshooting. Gene Thesaurus, 
the other package is a 'Varehouse 
of bioinformatics data,** says 
BellensoiL ■ 



breast cancer, breast cancer with liver 
metastases, glioblastoma, malignant 
ascites due to gastrotntestina] cancer 
and ovarian cancer. 

Cqnes of the GTAC ihrid annual 
report are available from the GTAC 
Secretarial, Wellington House, 133- 
15S Waterkx) Road, London SE1 
8UG, UK. 



Coated Lenses Prevent PCO 



Scientists in the UK. say it may be 
possitdc to prevem posterior capsule 
opacification (PCO>, a common 
complication following cataract 
surgery, by using the implanted poty- 
methyhnethacrylate (PMMA) 
intraocular lens as a drug delivery 
system. PCX) occui3 in 30-50% of 
cataract surgery patients as a result of 
stinnilated cell growth within the 
remaining capsular teg. The condi- 
tion causes a decline in visual acuiiy 
and requires cq>ensive laser treat- 
ment, dws negating die routine use of 
cataract surgery in underdeveloped 
countries, explains G. Duncan, at the 



\57 



Iff 



HIGH SPECIFIC ACTIVITY 
MICROBIAL ALKALINE 
PHOSPHATASE 

from Biocatalysts 

Biocatalysts Limited, the British speciality enzyme 
company, has developed a completely new type of 
' alkaline phosphatase with many advantages over the 
types most commonly used. 

It is of microbial origin with a high specific activity 
(unlike that from E coli) and with higher temperatiire and 
storage stability compared to that from calf intestine. 

This is the first of severai new generation diagnostic 
enzymes being developed by Biocatalysts Limited with 
greatly improved stability. 

* Non-animal source, no risk of BSE or animal 
virus contamination 

* Higher teraperalure stability than calf Intestine 

* Much liigher specific acthrKy than from E. coll 

* Very high storage stability even in tlie absence 
of glycerol 

For fufther detaifs on alkaline f^msphatase a/td our other 
diagtwstic emymes contact us direct at the address t}efow or 
wiltiin North America contact our US Distributor KaOfOthPettibone 
'photw: 630 3S0 1116 or tax: esO-SSO- 1606 

Btocatalytto Umtted 

Tratorest Isdostrlal Estatt PentyprUd Wain OK CF37 5UD 
Tel: 444 (0)1443 B4S712 Fax: -f44 (0)1443 S41214 
e-full-l8lly@Blocatalyitsxoa. 



Jl 



Reference 10 of 20 

with Response dated 05/04/04 

In USSN: 09/857,826 



Rscher-Vize. Saence 270. 1828(1995). 

35- T. C. James and S. C. Elgin. Md. Ceil Biol. 6, 3862 
(1986): R. Paro and D. S. Hogness, flnoc. Natl. Acad 
Set. U.S.A. 88, 263 (1991); B. Tschiersch et al., 
EMBOX 13. 3822 (1994); M. T. Madireddi etal..Cell 
87, 75 (1996); D. G. Stokes, K. D. Tarlof, R. P. Perry, 
Proc. Natl. Acad Set. US A. 93. 7137 (1996). 

36. P. M. PaJosaari ef a!.. J. Biol. Chem. 266, 10750 
(1991); A. Schmitz, K. H. Gartemann, J. Redler, E. 



Grund, R. Echenlaub, Appl. Environ. Microbiol. 58. 
4068 (1992); V. Sharma. K. Suvama, R. Mega- 
nathan, M. E. Hudspeth, J. Bact&iol. 174. 5057 

(1992) ; M. Kanazawa et a/.. Enzyme Protein 47, 9 

(1993) ; Z. L Boynton, G. Bermet, F. B, Rudolph. 
J. Bacteriol. 178, 3015 (1996). 

37. M. Hoefa/.,Ceff77,869(1994). 

38. W. Hendriks ef al.,J.CeS Biochem. 59, 418 (1995). 

39. We thank H. Skadetsky and F. Lewitter for help with 



sequence analysis; Lawrence LiverniOTe NaUonal 
Laboratory (or the flow- sorted Y cosnred library; and 
P. Bain, A. Bortvin. A de la Chapetle, G. fink, K. 
Jegalian, T. KawagucN, E. Lander, H. Lodish, P. 
Matsudaira, D. Menke. U. RajBhandary, R. Reip, S, 
Rozen, A. Schwartz, C. Sun, and C. TBford for com- 
ments on the manuscript. Supported by NIH. 

28 Apnl 1997; acc^ted 9 September 1997 



Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to cairy out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays v/ere also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1, These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



Xhe complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (J, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 

Department of Biochemistry, Stanford University School 
ot Medicine. Howanj Hughes Medical Institute. Stanford. 
OA 94305-5428. USA. 

•To whom correspondence should be addressed. E-mail: 
pbrown®cmgm. stanford.edu 



favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cis regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDNA was prepared by reverse tran- 
scription in the presence of Cy3 (green) - 
or Cy5( red) -labeled deoxyuridine triphos- 
phate (dUTP) (II) and then hybridized to 
the microarrays (J 2). To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDNA 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression-ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2 -hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold (14). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no cunently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 



680 



SCIENCE • VOL. 278 • 24 OCTOBER 1997 • www.sciencemag.org 



\ 



REPORTS 



to any gene whose function is known {15). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase {ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACSJ), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl -Co A, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCKJ, encoding 
phosphoenol pyruvate carboxykinase, and 
FBPl, encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
cose-6-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coord i- 
nately induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) ('3). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (J 3). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (J 6-20). A search 
in the promoter regions of the remaining two 
genes, ACRJ and IDP2, revealed that 
ACRJ, a gene essential for ACS] activity, 
also possessed a consensus CSRE motif, but 
interestingly, 1DP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with, the exception 



\ : ..„HSP2S — 



,^HSP12 : 



yQLZ04 



• ■ 

- ■• • • ■ * 



. H3P42': 



' cm,,.-' 



,PDC1 



« a 4. 

• RPU8 ^ f 



VPt142, 



CPP1 



0 



TXJHI 



PDRS. 



• e « . . . . 

I * - 1. 



, H)a7 



.TYiB 



VGP1 



1 • < 



.PGK1 



YER150 



. ' . YGR246.C • 
'YKL025^ 



ft 

V V 41 ^ U C 



RPLIGAy . 



Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confoca! microscope used to collect all the data we report (49). A fluorescently labeled 
cDNA probe was prepared from mRNA Isolated from cells harvested shortly after inoculation (culture 
density of <5 x 10® cells/ml and media glucose level of 19 g/liter) by reverse transcription In the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of -2 x 10^ cells/ml, with a glucose level of 
<0.2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
CyS-dLTTP-^abeled cDNA (that is. mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-labeled cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 



www.sciencemag.org • SCIENCE • VOL. 278 • 24 OCTOBER 1997 



681 



\ 



of HSP42, have previously been shown to 
be controiled at lease in part by these 
elements {21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of th^se genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile (including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)1, nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2,3A has been shown 
to be responsible for indiiction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3A (30). Indeed, a putative 
HAP2,3,4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, H API, 3, 4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS ) 
that is recognized by the Rapl DNA-bind- 
ing protein (3J, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression {34)- Indeed, we ob- 
served that the abundance of RAPJ 
mRNA diminished by 4-4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and SIP4, were induced by a factor of 
more than threefold at the diauxic shift. 
SIP4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl , the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of S/P4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measuremerits obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confir 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
iritegration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large nurnber of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Rg. 1 is shown tor each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expressbn during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the tupm mu- 
tatbn and YAPl overexpres- 
sion, red spots represent 
genes whose expression was 
increased, ortd green spots 
represent genes whose ex- 
pressbn was decreased by 
the ger^etic modificatbn. Note 
that distinct sets of genes are 
induced arid repressed in the 
different experiments. The 
complete images, of each of 
these arrays can be viewed on 
the Internet (73). Cell density 
as measured by optical densi- 
ty (00) at 600 nm was used to 
measure the growth of the 
culture. 



Growth OD 0.14 



Growth OD 0.46 



Growth OD 0.8 



*Putative homocitrate 




Growth OD 7.3 



Atupl 



YAPl 



» * 



- sTLr ; - 



o&Kryl-alcohol 



• " oxtdorediTctasc 

- 0 J- ^ 

Q it V ^ , 

Homology to mallasc " * o 

'■■i . ■ Baptcilal CsgA 

TIpl homolog ^ ^ ^ ^honxolog , 

* ' <» ,YAP1 O - 

^» ^ 5 o o V - . o 



682 



SCIENCE • VOL. 278 • 24 OCTOBER 1997 • www.sciencemag.org 



REPORTS 



by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUP I gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migl and is mediated by recruiting the tran- 
scriptional CO- repressors Tup I and Cyc8/ 
Ssn6 (39), Tupl has also been implicated in 
repression of oxygen-regulated, mating-type- 
specific, and DNA-damage-inducible genes 
(40). 



3.3 



tfrHl;2 



-TreMtose; 




6.1 



GSYIi' 
GLGt,2 



GLC3 




Oebranching 
'Jtycoger>^-i — 

J 



IGPHI I 



VPR184 



Glucose 



5.8 
3^ 



HXK1 

GtK1. 



Pentose Phosphate 
Pathway, RNA, DNA, 
Proteins 



14.4! 




2.5 



Glycolysis / 
gluconeogenesis 




Rg. 3. Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyl-CoA synthase 
and gfycogen-debranching enzyme have not t>een explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and glycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major increases in the fbw of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 



Wild-type yeast cells and cells bearing 
a deletion of the TUP J gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively (II). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tupJA 
strain, and thus presumably repressed by 
Tupl (41 )- A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tupl A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion (complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl , suggesting that these genes may be 
subject to TUPl -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPl. 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the mating-type— specific genes MFAl and 
MFA2, and the DNA damage-inducible 
RNR2 and RNR4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal conre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tupl A strain, providing a positive 
control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For iristance, al- 
though only about 3% of all yeast genes 
appeared to be TUPl -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUPl 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2.5-fold in the tup /A 



www.sciencemag.org • SCIENCE • VOL. 278 • 24 OCTOBER 1997 



683 



strain, and 18 of these genes were induced 
by more than sevenfold when TUP I was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUPJ. Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFAJ 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup] A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAI and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAPl en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAPi in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metab. and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAP] 
under the control of the strong GALJ-10 
promoter, both grown in galactose (that is, 
a condition that induces YAPJ overexpres- 
sion). Complementary DNA from the con- 
trol and YAPl overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAP/. 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



YAPi was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol ox idoreduc tases in 
S. cerevisiae, but these enzymes have been 
isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 



Fold 
induction 



Fold 
repression 




- Glycogen/Trehalose 
— Cytochrome-c 

TCA / Gtyoxalate cycle 



11 



13 15 17 
Time (hours) 



19 21 



Ribosomat proteins 
•Translation elcmgation/iniL 
■IRNA synthetase 



Fig, 4. Coordinated reg- 
ulation of functionalty re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follov^^: ribosomal 
proteins. 112; translation 
elongation and initiation 

factors. 25; tRNA synthetases (excluding mitochondial synthetases). 17; gtycdgen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAPI overexpression. This list includes all the genes for which mRNA levels 
increased by more than twofold upon YAPI overexpression in both of two duplicate experiments, artd 
for which the average increase in mRNA level in the two experiments was greater than threefold (50). 
Positions of the canonical Yap1 binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 



ORF 


Distance of Yap1 
site from ATG 


Gene 


Description 


Fold- 
increase 


YNL331C 






Putative aryl-alcohol reductase 


12.9 


YKL071W 


1 62-222 (5 sites) 




Similarity to bacterial csgA protein 


10.4 


YML007W 




YAPI 


Transcriptior^al activator involved in 
oxidative stress response 


9.8 


YFL056C 


223, 242 




Homology to aryl-alcohol 
dehydrogenases 


9.0. 


YLL060C 


98 




Putative glutathione transferase 


7.4 


YOL165C 


266 




Putative aryl-alcohol dehydrogenase 
{NADP+) 


7.0 


YCR107W 






Putative aryl-alcohol reductase 


6.5 


YML116W 


409 


ATR1 


Aminotriazole and 4-nitroquinoline 
resistance protein 


6.5 


YBR008C 


142. 167,364 




Homology to benomyl/methotrexate 
resistance protein 


6.1 


YCIJ<08C 






Hypothetical protein 


6.1 


YJR155W 






Putative aryl-alcohol dehydrogenase 


6.0 


YPL171C 


148. 212 


OYE3 


NAPDH dehydrogenase {old yellow 
enzyme), isoform 3 


5.8 


YLR460C 


167, 317 




Homology to hypothetical proteins 
YCR102C and YNL134c 


4.7 


YKR076W 


178 




Homology to hypothetical protein 
YMR251W 


4.5 


YHR179W 


327 


OYE2 


NAD(P)H oxidoreductase (old yellow 
enzyme), isoform 1 


4.1 


YML131W 


507 




Similarity to A. thadana zeta-crystallin 
homolog 


3.7 


YOL1 26C 




MDH2 


Maiate dehydrogenase 


3.3 



684 



SCIENCE • VOL. 278 • 24 OCTOBER 1997 • www.sciencemag.org 



REPORTS 



ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical bjinding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors^ Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the idea! chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. TKe 
hurdles to extending this approach to any 
other organism are minor. The equipment 




FoU 

induction 4 



raprssson 



11 13 15 17 19 21 



Time (hours) 

Rg. 5. Distinct temporat patterns of induction or repression help to group genes that share regulatory 
properties. (A) Temporal profile of the cell density, as measured by OD at 600 nm and glucose 
concentration in the media. (B) Seven genes exhibited a strong induction (greater than ninefold) only at 
the last timepoint (20.5 hours). With the exception of IDP2, each of these genes has a CSRE UAS. There 
were no additional genes obsen/ed to nnatch this profile. (C) Seven members of a class of genes marked 
by early induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif 
repeats in their upstream promoter regbns. (D) Cytochrome c oxidase and ubiquinol cytochrome c 
reductase genes. Marked by an induction coincident with the diauxic shift, each of these genes contains 
a consensus binding motif for the HAP2,3,4 protein complex. At least 17 genes shared a similar 
expression profile. (E) SAMh GPP1. and several genes of unknown function are repressed before the 
diauxic shift, and continue to be repressed upon entry into stationary phase. (F) Ribosomal protein 
genes comprise a large class of genes that are repressed upon depletion of glucose. Each of the genes 
profiled here contains one or more RAPl -binding motifs upstream of its promoter. RAP1 is a transcrip- 
tional regulator of most ribosomal proteins. 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the ampHfication of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 110 microarrays of 6400 elentients 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gerie and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
pret ing, and extracting insights from the 
large volumes of data these experiments 
will provide. 

REFERENCES AND NOTES 



1. M. Schena. D. Sh^on. R. W. Davis, P. O. Brown. 
Science 270, 467 (1995). 

2. D. Shalon. S. J. Smith. P. O. Brown. Genome Res, 6. 
639 (1996). 

3. D. Lashkari, Proc. Natl. Acad Sci. USA, in press. 

4. J. DeRisi et al.. Nature Genet. 14. 457 (1996). 

5. D. J. Lockhart ef a/., Nature Biotechnol. 14, 1675 
(1996). 

6. M. Chee et al,, Sdence 274. 610 (1996). 

7. M. Johnston and M. Carlson, in The Motecular Bol- 
ogy of the Yeast Saccharomyces: Gene Expression, 
E. W- Jones, J. R. Pringte. J. R Broach, Eds. (Cold 
Spring Harbor Laboratory Press. Cold Spring Har- 
bor. NY. 1992), p. 193- 

8. FVimers for each krrawn or predicted protein coding 
sequence were supplied by Research Genetics. 
PGR was performed with the protocol supplied by 
Research Genetics, using genomic DNA from yeast 
strain S288C as a template. Each PGR product was 
verified by agarose gel electrophoresis and was 
deemed correct if the lane contairwd a single ttand of 
appropriate moWrrty. Failures were marked as such 
in the database. The overall success rate for a single- 
pass ampTrfication of 61 16 ORFs was ^94.5%. 

9. Glass slides (Gold Seal) were cleaned for 2 hours in a 
solution of 2 N NaOH and 70% ethanol. After rinsing 
in distilled water, the slides were then treated with a 
1 : 5 dilution of poly-L-tysine adhesive solution (Sig- 
ma) for 1 hour, and then dried for 5 min at 40'C in a 
vacuum oven. DNA samples from 1 00-pJ PGR reac- 
tions were purified by ethanol purification in 96-weII 
microtiter plates. The resulting precipitates were re- 
suspended in 3x standard saline citrate (SSC) and 
transferred to new plates for arraying. A custorrv built 
arraying robot was used to print on a batch of 110 
slides. Details of the design of the microarrayer are 
available at cmgm.stanford.edu/pbrown. After print- 
ing, the microarrays were rehydrated for 30 s in a 
humid charriber and then snap-dried for 2 s on a hot 
plate (lOO-C). The DNA was then ultraviolet (UV)- 
crossiinked to the surlace by sub}ecting the slides to 
60 mJ of energy (Stratagene Stratalinker). The rest of 
the poly-L-lyslne surface was blocked by a l5-min 
incubation in a solutton of 70 mM succinic anhydride 
dissolved in a solution consisting of 315 ml of 1- 
methyl-2-pyrTOldinone (Aldrich) and 35 ml of 1 1^ 
boric acid {pH 8.0). Directly after the blocking reac- 



www.sciencemag.org • SCIENCE • VOL. 278 • 24 CCTOBER 1997 



685 



tion. the bound DNA was denatured by a 2-min in- 
cubation in distilled water at — QS'C. The slides were 
then transferred into a bath of 100% ethanot at room 
tennperature, rinsed, and then spun dry in a cfinica! 
centrifuge. Slides were stored in a dosed box at 
room temperature until used. 

10. YPO medium (8 liters), in a 10-liter fermentation 
vess^, was inoculated with 2 ml of a fresh over- 
night culture of yeast strain DBY7286 (MATa, ura3, 
GAL2). The fermentor was maintained at 30*0 with 
constant agitation and aeration. The glucose con- 
tent of the media was nrjeasured with a UV test kit 
(Boehringer Mannheim, catalog number 716251) 
Cell density was measured by OD at 600-nm wave- 
length. Aliquots of culture were rapidly withdrawn 
from the fermentation vessel by peristaltic pump, 
spun down at room tennperature, and then flash 
frozen with liquid nitrogen. Frozen cells v^e stored 
at -80'C. 

1 1 . CyS-dtJTP or Cy5-dUTP (Amersham) was incorpo- 
rated during reverse transcription of 1.25 jig of 
polyadenytated |poly(A)*| RNA, primed by a dT(16) 
oBgomfer. This mixture was heated to 70*C for 10 
min. and then transferred to ice, A prennixed solu- 
tion. consisUng of 200 U Superscript tt (Gibco), 
buffer, deoxyrtoonucleoside triphosphates, and flu- 
orescent nucleotides, was added to the RNA. Nu- 
cleotides were used at these final concentrations: 
500 for dATP, dCTP, and dGTP and 200 m-M 
for dTTP. Cy3-dUTP and CyS-dUTP were used at 
a final concentration of 100 m-M. The reaction was 
then incubated at 42'C for 2 hours. Unincorporat- 
ed fluorescent nucleotides were removed by first 
diluting the reaction mixture with of 470 pJ of 10 
mM tris-HCI (pH 8.0)/1 mM EDTA and then subse- 
quently concentrating the mix to -5 (il, using Cen- 
tricon-30 microconcentrators (Amicon). 

1 2. Purified, labeled cDNA was resuspended in 1 1 of 
3.5X SSC containing 10 m-O poMdA) and 0.3 m.1 of 
10% SDS. Before hybridization, the solution was 
boiled for 2 min and then allowed to cool to room 
temperature. The solution was applied to the mi- 
croarray urnder a cover slip, and the slide was 
placed in a custom hybridization chamber which 
was subsequently incubated tor ^8 to 1 2 hours in 
a w/ater bath at 62'C. Before scanning, slides were 
washed in 2x SSC. 0.2% SDS for 5 min. and then 
O.OSx SSC for 1 min. Slides were dried before 
scanning by centrifugation at 500 rpm in a Beck- 
man CS-6R centrifuge. 

1 3. The complete data set is avaHabie on the Internet at 
cmgm.stanford.edu/pbrown/explore/index.html 

14. For 95% of aB the genes analyzed, the mRNA leveis 
measured in cells harvested at the first and second 
interval after inoculation dffered by a factor of less 
than 1 .5. The correlation coefficient for the compar- 
ison l^etween mRhJA levels measured for each gene 
in these two different mRNA samples was 0.98. 
When duplicate mRNA preparations from the same 
cell sample were compared in the same way, the 
correlation coefficient between the expression levels 
measured for the two samples by comparative hy- 
bridization was 0.99. 

15. The numbers and identities of known and putative 
genes, and their homotogies to other genes, were 
gathered from the fdiowing pubBc databases: Sac- 
c/ianamyces Genome Database (genome-www. 
stanford.edu), Yeast Protein Database (quest7. 
proteome.com), arxl Munich Information Centre for 
Protein Sequences {speedy.mips.biochem.mpg.de/ 
mips/yeast/index.htmlx). 

16. A. Scholer and H. J. Schuner, Mof. CeU. Biol. 14. 
3613(1994). 

1 7. S. Kratzer and H. J. Schuller, Gene 1 61 , 75 (1 995). 

18. R. J. Haselbeck and H. L. McAlister, J. Biol. Chem. 
268. 12116(1993). 

19. M, Fernandez, E. Fernandez, R. Rodido, Mol. Gen. 
Genet. 242, 727 (1994). 

20. A. Hartig et a/., Nudeic Adds Res. 20, S677 (1992). 

21. P. f^.l^inezef a/.. EM80 J. 15. 2227 (1996). 

22. J. C. Varela, U. M. Praekelt. P. A. Meacock. R. J. 
Planta. W. H. Mager, Mol. CeS. Biol. 1 5. 6232 (1 995). 

23. H. Ruts and C. Schuller. Soessays 17. 959 (1995). 

24. J. L Parrou. U. A. Teste, J- Francois. Microbiology 
143, 1891 (1997). 



25. This expression profBe was defined as having an 
induction of greater than 10-fold at 18.5 hours and 
less than 1 1 -fokJ at 20.5 hours. 

26. S. L Forsburg and L. Guarenle. Genes Dev. 3. 1 1 66 
(1989). 

27. J. T, aesen and L Guarente. itxd. 4, 1714 (1990). 

28. M. Rosenkrantz, C. S. Kell. E. A. Pennell, L J. De- 
venish. Mo/. Microbiol. 13. 1 19 (1994). 

29. Single-letter abbreviations for the amino add resi- 
dues are as follows: A, Ala; C. Cys; D, Asp; E, GJu; F, 
Phe; G, Gly; H, His; I, He; K. Lys: L, Leu; M. Met; N, 
Asn; P, Pro; 0. Gin; R. Arg; S. Ser: T, Thr; V, Val; W. 
Trp: and Y, Tyr. The nudeotide codes are as follows: 
B-C. G. or T; N-G. A, T, or C; R-A or G; and Y-C or 
T. 

30. C. Fondrat and A. Kalogeropoulos, Comput. Appl. 
fifosa 12,353(1996). 

31. D. Shore. Trends Genet. 10. 408 (1994). 

32. R. J. Ranta and H. /1l Raue. ibid. 4. 64 (1988). 

33. The degenerate consensus sequence VYCYRNNC- 
MNH v/as used to search for potential RAP1 -bindrig 
sites. The exact consensus, as defined t>y pp). is 
WACAYCCRTACATYW, with up to three dfferenc- 
es allowed. 

34. S. F. Neuman. S. Bhattacharya. J. R. Broach, Moi. 
CeU. Biol. 15, 3187 (1995). 

35. P. Lesage, X. Yang. M. Cartson. ibid. 16, 1921 
(1996). 

36. For example, we observed large inductions of the 
genes coding for PCKI. FBPl \Z. Yin et aJ., Mol. 
Microbiol. 20, 751 (1996)], the central glyoxylate 
cyde gene ICLI [A. Scholer and H. J. Schuller, 
Curr Genet. 23. 375 (1993)1. and the "aerobic" 
tsoform of acetyt-CoA synthase, ACS1 |M. A. van 
den Berg et al. . J. Biol. Chem. 271 . 28953 (1 996)). 
with concomitant down- regulation of the glycolyt- 
ic-spedfic genes PYK7 and (P. A. Moore ef 
a/.. Mo/. Ceff. Biol. 11. 5330 (1991)]. aher genes 
not directly involved in cartx)n metabolism but 
known to be irKjuced upon nutrient limitation in- 
dude genes encoding cytosdic catalase T CT 77 
IP. H. Bissinger ef al., ibid. 9, 1309 (1989)) and 
several genes encoding small heat-shock proteins, 
such as HSP12. HSP26, and HSP42 [1. Faricas ef 
a/., J. Biol. Chem. 266. 15602 (1991); U. M. 
Praekelt and P. A. Meacock, Mol. Gen. Genet. 223, 
97 (1990); D. Wotton ef al., J. Biol. Chem. 271, 
2717(1996)]. 

37. The levels of induction we measured for genes that 
were expressed at very low levels in the unirxiuced 
state (notably, f BP7 and PCK7) were generally kywer 
than those prevbusly reported. This discrepancy 
was likdy due to the conservative background sub- 
traction method we used, which generally resulted in 
overestimation of very bw expression levels (46). 

38. Cross-hybridization of highly related sequences can 
also occasionally otsscure changes in g^e expres- 
sion, an important concern where members of gene 
famines are fundronaDy specialized ar>d cBfferentiatly 
regulated. The major alcohol dehydrogenase genes. 
ADH1 and ADH2, share 88% nudeotide identity. 
Reciprocal regulation of these genes is an important 
feature of the dauxic shift, but was no\ obsen/ed in 
this experiment, presunriably because of cross-hy- 
bridization of the fluorescent cDNAs representing 
these two genes. Nevertheless, we were able to de- 
tect differential expression of closely related isofonns 
of other enzymes, such as HXK1/HXK2 (77% iden- 
tical) (P. Herrero ef al. . Yeast 1 1 . 1 37 (1 996)1. MLS 7/ 
DAL7 (73% identical) (20). and PGM1/PGM2 (72% 
identteal) [D. Oh, J. E. Hopper. Md. Cell. Biol. 10. 
1 41 5 (1 990)], h accord with previous studies. Use in 
the rrucro^ay of deGberatdy selected DNA se- 
quences con^esponding to the most divergent seg- 
ments of honrtdogous g^es. in lieu of the complete 
gene sequences, should relieve this prot>lem in many 
cases. 

39. F. E. WiPiams. U. Varanasi, R. J. Tmmbly. Mol. Cell. 
Bid. 11.3307 (1991). 

40. D. Tzamarias and K. Stmhl. Nature 369, 758 (1994). 

41. Differer>ces in mRNA levels between the tuplA and 
wild-type strain were measured in two Independent 
experiments. The correlation coetfident between the 
complete sets of expression ratios measured in 
these duplicate expeiments was 0.83. The concor- 



dance between the sets of genes that appeared to 
be induced was very high between the two experi- 
ments. When only the 355 genes that showed at 
least a twofold irrcrease in mRf^ In the tupl A strain 
in ^er of the duplicate experiments were com- 
pared, the conrelation coeffident was 0.82. 

42. The luplA mutation consists of an insertion of the 
LEU2 coding sequence, including a stop codon, be- 
tween the ATG of 7UP7 and an Eco R I site 1 24 base 
pairs t>efore the stop codon of the TUP1 gene. 

43. L R. Kowalski, K. Kondo, M. Inouye, Md. Micrdbid. 
15.341 (1995). 

44. M. Viswanathan, G. Muthukumar, Y. S. Cong, J. 
Lenard, Gene 148, 149 (1994). 

45. D. Hirata, K. Yano, T. Miyakawa, Md. Gen. Genef. 
242, 250 (1994). 

46. A. Gutierrez, L Caramelo, A. Prieto. M. J. Martinez. 
A. T. Martinez. Appl. Environ. Microbld. 60, 1783 
(1994). 

47. A. Muheim et a/.. Eur. J. Bochem. 195. 369 (1991). 

48. J. A. Wemmie, M. S Szczypka. D. J. Tbiele, W. S. 
Moye-Rowtey, J. Bid. Chem. 269, 32592 (1994). 

49. Microarrays were scanned using a custom-built 
scanning laser microscope built by S. Smith with 
softw^e vwTtten by N. Ziv. Details concerning scan- 
ner design and construction are available at cmgm. 
Stanford. edu/p^own. Images were scanr>ed at a 
resolution of 20 jim per pixd. separate scan, usirvg 
the apprc^riate excitatkxi fine; v/as done for eadn ol 
the two fluorophores used. During the scannir>g pro- 
cess, the ratb t)etween the signals in the two chan- 
nels was calculated for several array elenrents con- 
taining total gerw)mic DNA. To rrannnaBze the two 
channels with respect to overall intensity, we then 
adjusted photomuttipBer and laser power settings 
such that the signal ratio at these elements was as 
dose to 1 .0 as possible. The combined images were 
analyzed with custom-written software. A bourvjing 
box, fitted to the size of the DNA spots in each 
quadrant, vras placed over each array element. The 
average fluorescent intensity was calculated by sum- 
ming the intensities of each pixel present in a bound- 
ing box, and then dividing by the total number of 
pixels. Local area t)ackground was calculated for 
each an-ay element by determining the average fluo- 
rescent intensity for the bwer 20% of pixel intensi- 
ties, /^ou^ tNs method tends to underestimate 
the background, causir>g an, underestimation of ex- 
treme ratios, it produces a very consistent and noise- 
tderant approximation. Altfiough the analog-to- 
digital boa^ used for data cdlection possesses a 
wide dynamk; range (1 2 bits)s several signals were 
saturated (greater Oran the rr^mum sign^ intensity 
allowed) at the chosen settings. Therefore, extreme 
ratios at bright ^^nents are generally underestimat- 
ed. A signal was deemed significant if the average 
intensity after badtground subtraction was at least 
2.5-fold higher than the standard deviatfon in the 
background measurem^ts lor all dements the 
array. 

50. In addition to the 17 genes shown in Tatrie 1, three 
additional genes were ir>duced by an ava^e of 
more than threefdd in the duplicate experiments, but 
in one of the two experiments, the induction was less 
than twofold (range 1.6- to 1.9-foki) 

51. We thank H. Bennett, P. Spellman, J. Flavetto, M. 
Eisen, R. Pillai, B. Dunn, T. Ferea. and other mem- 
bers of th»e Brown lab for their assistance and helpful 
advice. We also thank S. Friend. D. Botstein. S. 
Smith, J. Hudson, and D. Dolginow for advtee. sup- 
port, and encouragement; K. Struhl and S. Chatter- 
jee for the Tupl deletion strain; L. Femandes for 
helpful advice on Yap1; and S. Wapholz and the 
reviewers for many helpful comments on the manu- 
script. Supported by a grant from the National Hu- 
man Genome Research Institute (NHGRI) 
(HG00450), aid by the Howard Hughes Medical In- 
stitute (HHMl). J.D.R. was supported by the HHMl 
and the NHGRI. V.R. was suppxDrted in part by an 
Institutional Training Grant in Genome Sdence (T32 
HG00044) from the NHGRI. P.O.B. is an assodate 
investigator of the HHMl. 

5 September 1997; accepted 22 September 1997 



686 



SCIENCE • VOL. 278 • 24 OCTOBER 1997 • www.sciencemag.org 



^ Referejice 1 1 of 20 

with Response dated 05/04/04 
In USSN: 09/857,826 

Proc. Natl. Acad, Sci USA 

Vol. 95, pp. 6073-6078, May 1998 

Biochemistry 



Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 

Steven E. BRENNER*tt, Cyrus Chothia*, and Tim J. P. Hubbard§ 

*MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and ^Sanger Centre, Wellcome Trust Genome Campus, Hinxton, 
Cambs CBIO ISA, United Kingdom 

Communicated by David R Davies, National Institute of Diabetes, Bethesda, MD, March 16, 1998 (received for review November 12, 1997) 



ABSTRACT Painvise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995)/. MoL BioL 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S, F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990). /. MoL BioL 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EmymoL 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. NatLAcad. ScL USA 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and wu-blast2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown, [t is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked ''advertisement" in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

© 1998 by The National Academy of Sciences 002 7-84 24 /98/956073-6$2. 00/0 
PNAS is available online at http://www.pnas.org. 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
BLAST (1) have changed, and WU-BLAST2 (2) — ^which produces 
gapped alignments — has become available. The latest version 
of FASTA (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely rehable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 



Abbreviation: EPQ, errors per query. 

tPresent address: Department of Structural Biology, Stanford Uni- 
versity, Fairchild Building D-109, Stanford, CA 94305-5126 

tTo whom reprints requests should be addressed, e-mail: brenner@ 
hyper.stanford.edu- 



6073 



6074 Biochemistry: Brenner et al. 



Proc. Natl. Acad. ScL USA 95 (1998) 



superfamilies. Pearson found that modern matrices and "in- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than Fasta, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and FASTa. Their test with BLAST 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used PROSITE (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in PIR and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48,00 with 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure, A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21), Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to FASTA and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior, 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From SCOP, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (pdb90D-b) has domains, which were all <9Q% 
identical to any other, whereas (pdB40D-b) had those <40% 
identical. The databases were created by first sorting all 
protein domains in SCOP by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or ^^0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford,edu/ 
sss/, and databases derived from the current version of SCOP 
may be found at http://scop.mrc-Imb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses On distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B, 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1,4,9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the FASTA 
package, version 3.0t76 (3), which provided fasta and the 
SSEARCH implementation of Smith-Waterman (8). For 
SSEARCH and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16), The default parameters and matrix (BLO- 
SUM62) were used for blast and wu-blast2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner et ai 



Proc. Natl. Acad. Sci. USA 95 (1998) 6015 



Smith-Waterman Scoring Sctiemes (PDB40D-B) 

1 I r 



Smith-Waterman Scoring Schemes (PDB90D-6) 



0.1 • 



o 

» 

Q. 

CO 
w 

I 0.01 



0.001 




0.05 



0.1 0.15 
Coverage 



0.2 



0.25 



o 
a. 

m 

s 

lU 



0.1 • 



0.01 



0.001 




0.1 



0.2 0.3 
Coverage 



0.4 



0.5 



Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith- Water man. (A) Analysis of pdB40D-B database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssEARCH program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. pdb40D-b contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. The>' axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.\5l~^-^^^ where 
I is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 




Hemoglobin p-cha(n (1 hdsb) Cellulase E2 (ItmlJ 



1 hdsb GKVDVDVWAQALGR- -LLVVypWTQRFFOHFGNLSSAGAVMMNPKVKAHGKRVLDAFTQGLKH 
1 tml_ GOVDALMSAAQAAGKIPILWYNAPGR DCGNHSSGGA PSHSAY-RSWIDEFAAGLKN 

Fro. 2, Unrelated proteins with high percentage identity. Hemo- 
globin ^-chain (pdb code Ihds chain b, ref. 38, Left) and cellulase E2 
(PDB code Itml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40). 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 

Percent Identity of Unrelated Proteins (PDB90D-B) 




0 50 100 150 200 

Alignment length 



Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
SSEARCH is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the HSSP threshold (though it is intended to be applied 
with a different matrix and parameters). 



6076 Biochemistry: Brenner et al. 



Proc. Natl Acad. Sci. USA 95 (1998) 



Reliability of StatlsUcal Scores (POBSOD-B) 



10 



SSEARCH E-Value 
FASTA ktup=1 E-Value 
FAST A ktup=2 E-Valu " 



S> 
o 
u 

CO 

ra 
o 

(0 




0.0001 



1e-06 



0.001 



0.01 0.1 

Errors Per Query 



10 



Fig. 4. Reliability of statistical scores in pdb90D-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
PASTA, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold Hne. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
SSEARCH and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blasT2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 

Sequence Comparison Algorithms (PDB40D-B) 



3 

o 

0) 

a. 
tn 



is 0.01 



0.001 




related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Aigorlttims (PDB90D-B) 



3 

o 

0) 
Q. 

» 
w 

O 
w 
w 

UJ 




0.001 



0.12 



0.14 



0.16 0.18 
Coverage 



0.2 



0.22 



0.25 



0.3 



0.35 
Coverage 



0.4 



0.45 



Fig. 5, Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). {A ) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 18% of relationships 
at 1% EPQ. fasta ktup = 1 and wu-blast2 are almost as good. {B) pdB90D-b database. The quick wu-blasT2 program provides the best coverage 
at Wo EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 



Biochemistry: Brenner et al. 



Proc. Nad. Acad Sci. USA 95 (1998) 6011 



likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from BLAST, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. BLAST, which 
identifies 15%, was the worst performer, whereas FASTA 
ktup = 1 is nearly as effective as SSEARCH. FASTA ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than BLAST and 6.5 times 
slower than fasta ktup = 1. wu-blast2 is slightly faster than 
FASTA ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is WU-BLAST2, Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. SSEARCH with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by SSEARCH E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 



2500 



CO 
O) 

o 

o 

E 
o 
X 



2000 



1500 



a 1000 
n 

E 
z 

500 



Distribution and Detection of Homologs (PDB40D-B) 
I I I I I I ~"~ » 



Total number of 
homologs in database 



Homologs detected by 
SSEARCH E-values 
at 1% EPQ 




10 15 20 25 30 
Percentage identity: in both 



Fig. 6. Distribution and detection of homologs in pdb40D-b. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the numl^er of these pairs found by the best database searching method 
(ssEARCH with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
Bl-AST was released: BLASTGP (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on BLASTGP using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped BLAST, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and («) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and SSEARCH give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by BLAST and wu-blast2 underestimate the true 



Table L Summary of sequence comparison methods with PDB40D-B 


Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


SSEARCH % identity: within alignment 


25.5 


>70% 


<0.1 


SSEARCH % identity: within both 


25.5 


34% 


3.0 


SSEARCH % identity: HSSP-scaied 


25.5 


35% (HSSP + 9.8) 


4.0 


SSEARCH Smith-Waterman raw scores 


25.5 


142 


10.5 


SSEARCH E-values 


25.5 


0.03 


18.4 


FASTA ktup = 1 E-values 


3.9 


0.03 


17.9 


FASTA ktup = 2 E-values 


L4 


0.03 


16.7 


WU-BI-AST2 P-values 


LI 


0.003 


17.5 


BL.AST P-values 


LO 


0.00016 


14.8 



*Times are from large database searches with genome proteins. 



6078 Biochemistry: Brenner et ai 



Proc. Natl Acad. ScL USA 95 (1998) 



extent of errors. Second, SSEARCH, WU-BLAST2, and fasta 
ktup = 1 perform best, though BLAST and FASTA ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



** Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanfordedu/sss/. 



The authors are grateful to Drs. A. G. Murzin, M. Levitt, S. R. Eddy, 
and G. Mitchison for valuable discussion. S.E.B. was principally 
supported by a St. John's College (Cambridge, UK) Benefactors' 
Scholarship and by the American Friends of Cambridge University. 
S.E.B. dedicates his contribution to the memory of Rabbi Albert T. 
and Clara S. Bilgray. 

L Altschul, S. R, Gish, W., Miller, W., Myers, E. W. & Lipman, 
D. J. (1990) / Moi Biol, 215, 403-410. 

2. Altschul, S. F. & Gish, W. (1996) Methods Enzymol 166, 460- 
480. 

3. Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. ScL USA 
85, 2444-2448. 

4. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) 
/ Mol. Biol 247y 536-540. 

5. Brenner, S, E., Chothia, C, Hubbard, T. J. P. & Murzin, A. G. 
(1996) Methods Enzymol. 266, 635-643. 

6. Pearson, W. R. (1991) Genomics 11, 635-650. 

7. Pearson, W. R. (1995) Protein Sci. 4, 1145-1160. 

8. Smith, T. F. & Waterman, M. S. (1981)/. Mol. Biol. 147, 195-197. 

9. George, D. G., Hunt, L. T. & Barker, W. C. (1996) Methods 
Enzymol. 266, 41-59, 

10. Vogt, G., Etzold,T. & Argos, P. (1995)7. Mol. Biol. 249, 816-831. 

11. Henikoff, S. & Henikoff, J. G. (1993) Proteins 17, 49-61. 

12. Bairoch, A. & Apweiler, R. (1996) Nucleic Acids Res. 24, 21-25. 

13. Bairoch, A., Bucher, P. & Hofmann, K. (1996) Nucleic Acids Res. 
24, 189-196. 

14. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl. Acad. ScL USA 
89, 10915-10919. 

15. Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of 
Protein Sequence and Structure, ed. Dayhoff, M. (National Bio- 



medical Research Foundation, Silver Spring, MD), Vol. 5, Suppl. 
3, pp. 345-352. 

16. Brenner, S. E. (1996) Ph.D. thesis. (University of Cambridge, 
UK). 

17. Sander, C. & Schneider, R. (1991) Proteins 9, 56-68. 

18. Johnson, M. S. & Overington, J. P. (1993) 7. MoL Bioi 233, 
716-738. 

19. Barton, G. J. & Sternberg, M. J. E. (1987) Protein Eng, 1, 89-94. 

20. Lesk, A. M., Levitt, M. & Chothia, C. (1986) Protein Eng. 1, 
77-78. 

21. Arratia, R., Gordon, L. & M, W. (1986) Ann, Stat. 14, 971-993. 

22. Karlin, S. & Altschul, S. R (1990) Proc. Natl. Acad. ScL USA 87, 
2264-2268. 

23. Karlin, S. & Altschul, S. F. (1993) Proc. Natl. Acad. ScL USA 90, 
5873-5877. 

24. Altschul, S. R, Boguski, M. S., Gish, W. & Wootton, J. C. (1994) 
Nat. Genet. 6, 119-129. 

25. Pearson, W. R. (1996) Methods Enzymol 266, 227-258. 

26. Lipman, D. J., Wilbur, W. J., Smith, T. F. & Waterman, M. S. 
(1984) Nucleic Acids Res. 12, 215-226. 

27. Wootton, J. C. & Federhen, S. (1996) Methods Enzymol 266, 
554-571. 

28. Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 367-381. 

29. Perutz, M. F., Kendrew, J. C. & Watson, H. C. (1965)7. MoL Biol 
13, 669-678. 

30. Abola, E. E-, Bernstein, F. C, Bryant, S. H., Koetzle, T. F. & 
Weng, J. (1987) in Crystallographic Databases: Information Con- 
tent, Software Systems, Scientific Applications, eds. Allen, F. H., 
Bergerhoff, G. & Sievers, R. (Data Comm. Intl. Union Crystal- 
logr., Cambridge, UK), pp. 107-132. 

31. Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997) Curr. Opin. 
Struct. Biol 7, 369-376. 

32. Orengo, C, Michie, A., Jones S, Jones D. T, Swindells M. B. &. 
Thornton, J. (1997) Structure (London) 5, 1093-1108. 

33. Zweig, M. H. & Campbell, G. (1993) Clin. Chem. 39, 561-577. 

34. Gribskov, M. & Robinson, N, L. (1996) Comput. Chem. 20, 25-33. 

35. Fitch, W. M. (1966) J. Mol Biol 16, 9-16. 

36. Chung, S. Y. & Subbiah, S. (1996) Structure (London) 4, 1123- 
1127. 

37. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, 
Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25, 
3389-3402. 

38. Girling, R., Schmidt, W., Jr, Houston, T., Amma, E. & Huisman, 
T. (1979)7 Mol Biol 131, 417-433. 

39. Spezio, M., Wilson, D. & Karplus, P. (1993) Biochemistry 32, 
9906-9916 

40. Sayle, R. A. &. Milner-White, E. J. (1995) Trends Biochem. ScL 
20, 374-376. 



Reference 12 of 20 

with Response dated 05/04/04 

In USSN: 09/857,826 



XENOBiOTicA , 1999, VOL. 29, NO. 7, 655-691 



Differential gene expression in drug metabolism and 
toxicology: practicalities, problems and potential 

JOHN C. ROCKETTt, DAVID J. ESDAILEf 
and G. GORDON GIBSON* 

Molecular Toxicology Laboratory, School of Biological Sciences, University of Surrey, 
Guildford, Surrey, GU2 5XH, UK 

Received January 8, 1999 

1 . An important feature of the work of many molecular biologists is identifying which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobiotic challenge. Such information has many uses, including the 
deciphering of molecular pathways and facilitating the development of new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven for 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2 . The aim of this review was to clarify the main methods of differential gene expression 
analysis and the mechanistic principles underlying them. Also included is a discussion on 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
* open * systems, which require no prior knowledge of the genes contained within the study 
model. Whilst these will eventually be replaced by 'closed * systems in the study of human, 
mouse and other commonly studied laboratory animals, they will remain a powerful tool for 
those examining less fashionable models. 

3. The use of suppression-PCR subtractive hybridization is exemplified in the 
identification of up- and down-regulated genes in rat hver following exposure to pheno- 
barbital, a well-known inducer of the drug metabolizing enzymes. 

4. Differential gene display provides a coherent platform for building libraries and 
microchip arrays of 'gene fingerprints* characteristic of known enzyme inducers and 
xenobiotic toxicants, which may be interrogated subsequently for the identification and 
characterization of xenobiotics of unknown biological properties. 



Introduction 

It is now apparent that the development of almost all cancers and many non- 
neoplastic diseases are accompanied by altered gene expression in the affected cells 
compared to their normal state (Hunter 1991, Wynford-Thomas 1991, Vogelstein 
and Kinzler 1993, Semenza 1994, Cassidy 1995, Kleinjan and Van Hegningen 1998). 
Such changes also occur in response to external stimuli such as pathogenic micro- 
organisms (Rohn et cd. 1996, Singh et al. 1997, GriflBn and Krishna 1998, Lunney 
1998) and xenobiotics (Sewall et aL 1995, Dogra et aL 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunouiy et al. 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur in any given cell in progressing from the normal to the * altered' state are 
enormous. Such profiling essentially provides a * fingerprint* of each step of a 



* Author for correspondence; e-mail; g.gibson@surrey, ac.uk 

t Current Address: US Environmental Protection Agency, National Health and Environmental 
Effects, Research Laboratory, Reproductive Toxicology Division, Res ear di Triangle Park, NC 27711, 
USA. 

X Rhone-Poulenc Agrochemicals, Toxicology Department, Sophia- Antipolis, Nice, France. 

Xendbhtica ISSN 0049-8254 pnnt/lSSN 1366-5928 online © 1999 Taylor & Francfc Ud 
http://www.tandf.co.uk/ JNLS/ xen.htm httpi//www. tayIorandfrancia.com/jNLS/xen. htm 



656 y C. Rockett et al. 

cell's development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobio tic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkonen et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down-regulated by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually-related to the toxicological phenomenon per 
se. This observation has led to an upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin -treated gene pools in target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/ down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination. Such approaches are beginnir^ to gain 
momentum, in that several biotechnology companies are commercially producing 
*gene chips' or *gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/ thousands of genes, some of which are 
degenerate in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad-spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining these genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now available to begin attempting this difficult 
challenge. Indeed, several * differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 



Differential gene expression 657 

altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in many situations, including 
invading pathogenic microbes (Zhao et al. 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990, Ragno et al. 1997, 
Maldarelli et al. 1998), in chemically treated cells (Syed et al. 1997, Rockett et al. 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaghi-Howe 1998), 
activated cells (Gurskaya et al. 1996, Wan et al. 1996), differentiated cells (Hara et 
al. 1991, Guimaraes et al. 1995a, b), and different cell types (Davis et al. 1984, 
Hedrick et al. 1984, Xhu et al. 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
advantage is that, in most cases, absolutely no prior knowledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one, with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including: 

(1) Differential screening, 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction — CCLS, suppression-PCR subtractive hybridization — 
SSH, and representational difference analysis — RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 
expression — SAGE — and gene expression fingerprinting — GEF), 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this very powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called *open' systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two * closed' 
systems (those utilising previously identified gene sequences), EST analysis and the 
use of DNA arrays, will also be considered briefly for completeness. Whilst 
emphasis will often be placed on suppression PGR subtractive hybridization (SSH, 
the approach employed in this laboratory), it is the aim of the authors to highlight, 
wherever possible, those areas of common interest to those who use, or intend to use, 
differential gene expression analysis. 

Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed * differential plaque filter 



658 J, C, Rockett et al. 

hybridization \ which was used to isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a genomic DNA library is prepared from normal, 
unstimulated cells of the test organism/tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled 
complex cDNA probes prepared from the control and test cell mRNA populations. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- 
regulated under certain conditions. For example, St John and Davis (1979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRNA /cDNA from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybridized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology involved the physical separation 
of hybridized common species from unique single stranded species. Several methods 
of achieving this have been described, including hydroxyapatite chromatography 
(Sargent and Dawid 1983), avidin-biotin technology (Duguid and Dinauer 1990) 
and oligodT-latex separation (Hara et al. 1991). In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
apatite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider et al. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1980, Davis et al. 1984, Hedrick et al. 
1984). A schematic diagram of the procedure is shown in figure 1. 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1990) described a method of subtraction utilizing biotin-aflfinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted to cDNA and an adaptor ('oligovector'. 



Differential gene expression 



659 



Control (driver) mRNA 

^AAAA 

^AAAA 

^AAAA 

^AAAA 



Tester (test) cDNA (1$t strand) 

mTT 

TTTTT 

^mTT 

TTTTT 



Mix (ratio >35:1)& hybridize 



-AAAA 
-HTT 



■AAAA 
■TTn 



■AAAA 
—AAAA 



TTTT 



■TTH 



Hydroxyapatite chromatography — > RNA:cDNA hybrids removed 



Unhybridized __]Zttttt 

cDNA (differentially expressed) 

and mRNA 

^AAAA 



Sepharose CL6B exclusion — > Small cDNA fragments (<450bp) 
chromatography 



Enriched, differentially expressed cDNA 




Produce clones Label directly and probe library 

Figure 1. The hydroxyapatite method of subtractive hybridization. cDNA derived from the 
treated /altered (tester) population is mixed with a large excess of mRNA from the control (driver) 
population. Following hybridization, mRNA-cDNA hybrids are removed by hydroxyapatite 
chromatography. The only cDNAs which remain are those which are differentially expressed in 
the treated/altered population. In order to facilitate the recovery of full length clones, small cDNA 
fragments are removed by exclusion chromatography. The remaining cDN As are then cloned into 
a vector for sequencing, or labelled and used directly to probe a library, as described by Sargent 
and Dawid (1983). 



containing a restriction site) ligated to both sides. Both populations are then 
ampHfied by PGR, but the driver cDNA population is subsequently digested with 
the adaptor-containing restriction endonuclease. This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA. 
Following denaturation and hybridization, the mix is applied to a biocytin column 
(streptavidin may also be used) to remove the control population, including 
heteroduplexes formed by annealing of common sequences from the tester 
population. The procedure is repeated several times following the addition of fresh 



660 



J. C. Rockett et al. 



Control (driver) mRNA 



Test (tester) mRNA 



■AAAA 
AAAA 



■AAAA 
-AAAA 



*nTT 

AAAA- 



Anneal mRNA to polydTao latex beads 



cDNA synthesis 



TITT 



Mix and anneal 



TT- 

AAA/Sr 



AAAA 



AAAA 



TO 



Centrifuge beads, collect and store supernatant, 
dissociate polyA. reapply supernatant 



AAAA 



AAAA 



Tester-specific mRNA retrieved after 
4 rounds of hybridization 



cDNA synthesis 
Ligate adaptors and insert into vector 

Sequence Inserts and/or carry out 
other downstream applications 

Figure 2. The use of oligodTgg latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara et al. (1991). 



Differential gene expression 



661 



control cDNA. In order to further enrich those species differentially expressed in 
the tester cDNA, the subtracted tester population is amplified by PGR following 
every second subtraction cycle. After six cycles of subtraction (three reamplification 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al, (1991) utilized a method whereby 
oligo(dT3o) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centri- 
fugation (the cDNA-oligotex-dT3Q forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not found in the driver cDNA-oligotex-dTg^ population. These 
tester-specific mRNA species are then converted to cDNA and, following the 
addition of adaptor sequences, amplified by PGR. The PGR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PGR primers. A schematic illustration of this subtraction process is shown in figure 
2. 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 

Chemical Cross -Linking Subtraction (CCLS ) 

In this technique, originally described by Hampson et al. (1992), driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20:1. The common 
sequences form cDNA:mRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diaziridinyl-l,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
to screen a cDNA library made from the tester cell population. A schematic diagram 
of the system is shown in figure 3. 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al, 1992), and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the GGLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with GGLS is the large amount of starting material 
required (at least 10 |ig RNA). Gonsequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et al, 1996, Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDNA. Since each primer includes a T7 polymerase promotor sequence 



662 



J, C. Rockett et al. 



Control (driver) mRNA Test (tester) mRNA 

^AAAA -AAAA 

^AAAA -AAAA 



1st strand cDNA synthesis ■ 
followed by alkaline hydrolysis J 



■rmr 

■TTTT 

■rnr 



Mix and anneal 



mRNA:cDNA hybrids ^AAAA 

^mr 



Unique cDNA species ^TTTT 



Cross linking agent 
(DZQ) added 



AAAA 

Hybrids are cross-linked xxxxxxxxx^^ 

Tin 



+ 



^TTH 

1 



Probes synthestsed from single stranded cDNA 
species and used to probe cDNA library 

Figure 3. Chemical cross-linking subtraction. Excess driver mRNA is mixed with 1** strand tester 
cDNA. The common sequences form mRNA:cDNA hybrids which are cross linked with 2,5 
diaziridinyl-l,4-benzoquinone (DZQ) and the remaining cDNA sequences are differentially 
expressed in the tester population. Probes are made from these sequences using Sequenase 2.0 
DNA polymerase, which lacks reverse transcriptase activity and, therefore, does not react with the 
remaining mRNA molecules from the driver. The labelled probes are then used to screen a cDNA 
library for clones of differentially expressed sequences. Adapted from Walter et al. (1996), with 
permission. 



Table 1. The abundance of mRNA species and classes in a typical mammaHan cell. 



mRNA 
class 


Copies of 

each 
species/cell 


No. of mRNA 
species in 
class 


Mean % of 
each species 
in class 


Mean mass 
(ng) of each 
species/ /ig 
total RNA 


Abundant 


12000 


4 


3.3 


1.65 


Intermediate 


300 


500 


0.08 


0.04 


Rare 


15 


11000 


0.004 


0.002 



Modified from Bertioli et al. (1995). 



Differential gene expression 



663 



at the 5 ' end, the final pool of random cDN A fragments is a PCR-renewable cDN A 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotinylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational D^erence Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PGR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first converted to cDNA 
and amplified by PGR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100:1. Following hybridization, only tester : tester 
homohybrids have 5 'adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3' ends. Hence, only these molecules are amplified exponentially during 
the subsequent PGR step. Although tester : driver heterohybrids are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver : driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PGR-enrichment of the tester : tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PGR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide gel. 

The main advantages of RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz (1 994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1 % of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false positives has been noted, this has 
been solved to some degree by O'Neill and Sinclair (1997) through the use of HPLG- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed 
linker capture subtraction (LGS) was described by Yang and Sytowski (1996). 



664 J. C. Rockett et al. 



ds control (driver) cDNA ds test (tester) cDNA 



\^ Digest with restriction enzyme 



I Ligate to I 

y dephosphorylated v 



12/24 adaptor 
strands 



i Melt12mer i 



I Fill in 3* ends (Taq), add I 

vj^ primer ( ) and 

amplify 



j Digest I Digest and ligate 



new 12/24 adaptor 



Mix 100:1, melt and hybridize 



Fill in ends, add primer ( — ) and amplify 

I I 

Linear amplification Exponential amplification No amplification 

I 

Digest PGR products with mung bean nuclease to remove 
ssDNA molecules present after amplification 

i 

First difference 

Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4-cutter restriction enzyme such as Dpnll. The 1'* set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3*ends filled in using Taq DNA polymerase. Each cDNA 
population is then ampHfied using PGR, following which the 1** set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added to the amplified tester cDNA 
population, after which the tester is hybridized against a large excess of driver. The 12mer 
adaptors are melted and the 3' ends filled in as before. PGR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester: tester combinations. Following PGR, ssDNA products are 
removed with mung bean nuclease, leaving the 'first difference product'. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3'** or 4* difference product, as described by Lisitsyn et al. 
(1993) and Hubank and Schatz (1994). 



Differential gene expression 



665 



Suppression PCR Suhtr active Hybridization (SSH) 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et al. (1996) and Gurskaya et al. (1996). 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used (see 
figure 5). 

In SSH, excess driver cDNA is added to two portions of the tester cDNA which 
have been ligated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1 985). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the anneaUng of 
single stranded complementary sequences which did not hybridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation/transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratory suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expression in the tester population, the number of 
clones that will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate a screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et aL 1997) 
and Wy-14,643 (Rockett et al, unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 



666 



J. C, Rockett et al. 



Tester cDNA with adaptor 1 



Driver cDNA 
(in excess) 



Tester cDNA with adaptor 2 




First Hybridization 



Mix samples, add fresh denatured driver, anneai 



a, b, c, d & 



Fill in ends 



a c 





ZZ7 



e 




Add primers and 
amplify by PGR 

a, d no amplification 

b no amplification - suppressed due to 
formation of panhandle structure 

c linear amplification 

e exponential amplification 

Figure 5. PCR-select cDNA subtraction. In the primary hybridization, an excess of driver cDNA is 
added to each tester cDNA population. The samples are heat denatured and allowed to hybridize 
for between 3 and 8 h. This serves two purposes : (1) to equalize rare and abundant molecules ; and 
(2) to enrich for differentially expressed sequences — cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denatured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type e molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PGR. The final products can be visualized on an agarose gel, labelled directly or cloned into a 
vector for downstream manipulation. As described by Diatchenko et al. (1996) and Gurskaya 
et al. (1996), with permission. 



Differential gene expression 



667 



Control animals 



Treated animals 



Extract mRNA from 
tissue of interest 
e.g. liver 



Extract mRNA from 
tissue of interest 
e.g. liver 



Dnase-treatment 



Dnase-lreatment 



Convert to cDNA 



Complex probe for 
screening clones 



Convert to cDNA 



Hybridization, subtraction and amplification 
ntrol driving tester for up-regulated genes 
Tester driving control for down-regulated genes 



Complex probe 
for screening 
clones 



Run out products on agarose gel 



Extract individual bands and cbne in 
T/A vector 



Screen using standard 
and HA agarose 



PCR of 5-10 clone 
cultures per 
extracted band 



Different clones blotted 
and screened with up- 
regulated genes 



Screen using standard 
and HA agarose 



Plasmid mini-preps 
of selected clones 



Differentially expressed 
clones selected 



Sequencing and 
identification 



Different clones blotted 
and screened with down- 
regulated genes 



Figure 6. Flow diagram showing method used in this laboratory to isolate and identify clones of genes 
which are differentially expressed in rat liver following short term exposure to the enzyme 
inducers, phenobarbital and Wy-14,643. 



of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down-regulated by phenobarbital in the rat (tables 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up/down-regulated subsequent to xenobiotic 



668 J, C. Rockett et al. 




Figure 7. SSH display patterns obtained from rat liver following 3 -day treatment with WY-14,643 or 
phenobarbital. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontech). Lane: 1 — Ikb 
ladder; 2 — genes upregulated following Wy, 14-643 treatment; 3 — genes downregulated following 
Wy, 14-643 treatment; 4 — genes upregulated following phenobarbital treatment; 5 — genes 
downregulated following phenobarbital treatment; 6 — Ikb ladder. Reproduced from Rockett et 
al. (1997), with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy, 14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 treatment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
al. 1998). Since all of these functions are altered to some extent in the phenomena 
of hepatomegaly and non-genotoxic hepatocarcinogenesis, it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a 'molecular fingerprint* in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 

Differential Display (DD) 

Originally described as *RNA fingerprinting by arbitrarily primed PGR' (Liang 
and Pardee 1992) this method is now more commonly referred to as * differential 



Differential gene expression 669 
Table 2. Genes up-regulated in rat liver following 3 -day exposure to phenobarbital. 



Band number 






(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL gene identification 


mJ \ M. *I \J\J / 


93 5% 


CYP2B1 


7 (1000) 


95.1% 


Preproalbumin 






Serum albumin mRNA 


8 (950) 


98.3% 


NCI-CGAP-Prl H. sapiens (EST) 


10(850) 


95.7% 


CYP2B1 


1 1 (800) 


Clone 1 94.9% 


CYP2B1 




Clone 2 75.3% 


CYP2B2 


12 (750) 


93.8% 


TRPM-2 mRNA 






Sulfated glycoprotein 


1 5 (600) 


92.9% 


Preproalbumin 






Serum albumin mRNA 


16(55) 


Clone 1 95.2% 


CYP2B1 




Clone 2 93.6% 


Haptoglobulin mRNA partial alpha 


21 (350) 


99.3% 


18S, 5.8S & 28S rRNa 


Bands 1-4, 6, 9, 13, 14, and 17-20 are shown to be false positives by dot blot anaylsis and, therefore, 


are not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes do not 


represent the complete spectrum of genes which are up-regulated in rat liver by phenobarbital, but 


simply represents th 


e genes sequenced and identifie 


\d to date. 


Table 3. Genes down-regulated in rat liver 


following 3-day exposure to phenobarbital. 


Band number 






(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL gene identification 


1 (1500) 


95.3% 


3-oxoacyl-CoA thiolase 


2 (1200) 


92.3% 


Hemopoxin mRNA 


3 (1000) 


91.7% 


Alpha-2u-gIobulin mRNA 


7 (700) 


Clone 1 77.2% 


M.muxulus CI inhibitor 




Clone 2 94.5% 


Electron transfer flavoprotein 




Clone 3 91.0% 


M. musculus Topoisomerase 1 (Topo 1) 


8 (650) 


Clone 1 86.9% 


Soares 2NbMT M. musculus (EST) 




Clone 2 96.2% 


Alpha-2u-globulin (s-type) mRNA 


9 (600) 


Clone 1 86.9% 


Soares mouse NML M. musculus (EST) 




Clone 2 82.0% 


Soares p3NMF 19.5 M. musculus (EST) 


10 (550) 


73.8% 


Soares mouse NML M. musculus (EST) 


11 (525) 


95.7% 


NCl-CGAP-Prl H. sapiens (EST) 


12 (375) 


100.0% 


Ribosomal protein 


13 (23) 


Clone 1 97.2% 


Soares mouse embryo NbME135 (EST) 




Clone 2 100.0% 


Fibrinogen B-beta-chain 




Clone 3 100.0% 


A po lipoprotein E gene 


14 (170) 


96.0% 


Soares p3NMF19.5 M. musculus (EST) 


15 (140) 


97.3% 


Stratagene mouse testis (EST) 


Others: (300) 


96.7% 


R, norvegicus RASP 1 mRNA 


(275) 


93.1% 


Soares mouse mammary gland (EST) 


EST = Expressed sequence tag. Bands 4-6 were shown to be false positives by dot blot analysis and, 


therefore, were not sequenced. Derived from Rockett a/. (1997). It should be noted that the above genes 


do not represent the complete spectrum of genes which are down-regulated in rat liver by phenobarbital, 


but simiply represents the genes sequenced and identified to date. 


display' (DD). I 


n this method, all the mRNA species in the control and treated cell 


populations are 


amplified in separate reactions using reverse transcriptase-PCR 


(RT-PCR). The products are then run 


side-by-side on sequencing gels. Those 


bands which are present in one display on 


ly, or which are much more intense in one 



670 y, C. Rockett et al. 

display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base * anchor * 
at the 3*-end, e.g. 5' (dTjj)CA 3' (Liang and Pardee 1992). Alternatively, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 1992). 
This variant of RNA fingerprinting has also been called *RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PGR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as many bacterial mRNAs 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDN A synthesis is carried out with an arbitrary primer 
{arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PGR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100 
products per primer set (Band and Sager 1989), When a combination of different 
dT-anchors and arbitrary primers are used, almost all mRN A species from a cell can 
be amplified. When the cDNA products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages: 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1995), although this has been disputed (Wan etal. 1996) and the isolation of very 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRN A 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes et al. 1995a). Since the 3 'end is often not included in Genbank and 
shows variation between organisms, cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

(3) The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70% of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced cells over a time course (Burn et al. 1994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique can be obtained from a review by McGlelland et al. (1996) and from 
articles by Liang et al. (1995) and Wan et al. (1996), 



Differential gene expression 



671 



mRNA 



(dTii)CA: AC 




Arbitrary primer: 



1 8* strand cDNA 
^ AC 



1 ^ strand cDNA 
< 



•AAAAAAA 



Denature and synthesise 2"*^ strand 
with any arbitrary primer ( ) 



strand cDNA 



^C 



strand cDNA 
► 



cDNA can now be amplified by PGR using original primer pair 

Figure 8. Two approaches to differential display (DD) analysis. 1** strand synthesis can be carried out 
either with a polydTjjNN primer (where N = G, C or A) or with an arbitrary primer. The use of 
different combinations of G, C and A to anchor the first strand polydT primer enables the priming 
of the majority of polyadenylated mRNAs. Arbitrary primers may hybridize at none, one or more 
places along the length of the mRNA, allowing 1^^ strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases, 2"^ strand synthesis is carried out with an arbitrary 
primer. Since these arbitrary primers for the 2"** strand may also hybridize to the strand cDNA 
in a number of different places, several different 2"** strand products may be obtained from one 
binding point of the 1** strand primer. Following 2"** strand synthesis, the original set of primers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
amplified. 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression (SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et al. 1995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95% of cases, short 
nucleotide sequences ('tags') of only nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatonation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a 
biotinylated polydT primer. Following digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme (^anchoring enzyme*), the 3' ends of the 
cDNA population are captured with streptavidin beads. The captured population is 



672 J. C. Rockett et al. 

split into two and different adaptors ligated to the 5 'ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme — one 
which cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDNA population with the IIS enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting ( GEF ) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin microbeads to facilitate removal of the unwanted 5' digestion 
products. The use of restricted 3^-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PGR is carried out with one adaptor- 
specific and one biotinylated polydT primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabelled dNTP. The labelled immobilized 3'cDNA 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders (equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in an average experiment. The use of 2-D gels such as 
those described by Uitterlinden et al. (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
described by Prashar and Weissman (1996). However, instead of sequential 
digestion of the immobolized 3'-terminal cDNA fragments, these authors simply 
compared the profiles of the control and treated populations without further 
manipulation. 



Differential gene expression 



673 



V\AAA 



■MAAA 



I** strand cDNA synthesis using 
biotinylated poly dl printers 



cDNA cleaved with AE and 
4^ captured with streptavivin beads 



GTAC 



GTAC 



.AAAA 
■TTTT< 

>\AAA 
JTTTi 




Divide in half and ligate linkers 




CATG 
GTAC 




CATG 
GTAC 



1 



■AAAA ^ 
-TTTTTW \ 



■AAAA 
■TTTTT 



•U 



CATG 
GTAC 

CATG 
GTAC- 




Cleave with tagging enzyme (TE) 
and produce blunt ends 



1 




GGATGCATGXXXXXXXXX 
CCTACGTACXXXXXXXXX 




GGATGCATGOOOOOOOOO 
CCTACGTACOOOOOOOOO 



TE AE 



TE AE 



Tag 



i 



Ligate and amplify 




GGATGCATGXXXXXXXXXOOOOOOOOOCATGCATCC 
CCTACGTACXXXXXXXXXOOOOOOOOOGTACGTAGG 



DiTag 



AE 



AE 



Cleave with AE, isolate diTags, 
concatenate, clone and 
sequence 

AE 



-CATGXXXXXXXXXOOOOOOOOOCATG XXXXXXXXXOOOOOOOOOCATG-- 
-GTACXXXXXXXXXOOOCMXlOOOGTAC XXXXXXXXXO00OOO0OOGTAC-- 



Tag 1 Tag 2 



Tag 3 Tag 4 



Figure 9. Serial analysis of gene expression (SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE)and the 3 'ends captured using streptavidin beads. ThecDNA pool is divided in half and each 
portion ligated to a different linker, each containing a type IIS restriction site (tagging enzyme, 
TE). Restriction with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligated and amplified using linker-specific primers. Following PGR, the products are cleaved with 
the AE and the ditags isolated from the linkers using PAGE. The ditags are then ligated (during 
which process, concatenization occurs) and cloned into a vector of choice for sequencing. After 
Velculescu et al. (1995), with permission. 



674 J. C, Rockett et al. 

DNA arrays 

*Open* differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so, each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analysis 
of gene expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al, 1992, Zhao et al. 1995, Schena et al. 1996), 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a gridded membrane or glass 'chips' containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling, DNA repair, development and other cellular processes. 
They are usually chosen to be as specific as possible for each gene and animal species . 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA /cDNA from the test 
populations can be labelled and used directly as probe. When analysed with 
appropriate hardware and software, arrays offer a rapid and quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the array 
(hence the term 'closed' system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease/development system maybe 
to combine an open and closed system — a DNA array to directly identify and 
quantitate the expression of known genes in mRNA populations, and an open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane — some companies have reported gridding up to 
60000 spots on a single glass *chip' (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response experiments. Aside from their 
high cost and the technical complexities involved in producing and probing DNA 
arrays, the main problem which remains, especially with the newer micro-array 
(gene-chip) technologies, is that results are often not wholly reproducible between 
arrays. However, this problem is being addressed and should be resolved within the 
next few years. 



EST databases as a means to identify differentially expressed genes 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be used to generate profiles of gene- 
expression in specific cells. Since they were first described by Adams et al. (1991), 
there has been a huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 



Differential gene expression 



675 



of/ all human genes (Hillier et al. 1996). This large number of freely available 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et al. (1998). The 
approach is simple in theory : EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed in-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http:/ /vww. tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et al. 1995), a 
tool for the assembly of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RN A blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analysis coupled with 
confirmatory molecular studies. Vasmatzis et aL (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each have their own distinctive cell populations. Also, in the case of neoplastic tissue, 
there are almost always normal, hyperplastic and/or dysplastic cells present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model may not necessarily arise 
exclusively from the intended * target* cells, e.g. hepatocytes/neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression of genes in the development of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Institute (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CGAP) (For more information see web site: 
http : / y^ww.ncbi.nlm .nih.gov /ncicgap /intro.html). There are also separation tech- 
niques available that utilise cell-specific antigens as a means to isolate target cells, 



676 J. C. Rockett et al. 

e.g. fluorescence activated cell sorting (FACS) (Dunbar et al. 1998, Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al. 1998, Rogler et al. 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell types which intimately 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probably 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debrisoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the study and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



Hozv efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-15 000 different mRNA species at any one time 
(Mechler and Rabbitts 1981, Hedrick et al. 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (Axel et al. 1976). Hedrick et al. (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1. 

When the results of differential display experiments have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is a major shortcoming, as the majority of 
mRNA species exist at levels of less than 0.005% of the total population (table 1). 
Bertioli et al, (1995) examined the efficiency of DD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 



Differential gene expression 



species present at less than 1.2% of the total mRNA population — equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems (single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000 X smaller. These results 
are probably best explained by competition for substrates from the many PGR 
products produced in a DD reaction. 

The numbers of differentially expressed mRNAs reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et al. (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-15 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated /upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et al. 
1990). In addition, Wan et al. (1996) estimated that interferon- /-stimulated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et al. (1993) found only 70 of 38000 
total bands to be different. Of these, 50% (35 genes) were shown to correspond to 
differentially expressed bands. Chen et al. (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
myristate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et al. (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et al. (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatrick et al. (1995) isolated 17 
genes upregulated in rat liver following treatment with the peroxisome proliferator, 
clofibrate; Philips et al. (1990) isolated 12 cDNA clones which were upregulated in 
highly metastatic mammary adenocarcinoma cell lines compared to poorly meta- 
static ones. Prashar and Weissman (1996) used V restriction fragment analysis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 

Whilst the latest differential display technologies are purported to include design 
and experimental modifications to overcome this lack of efficiency (in both the total 
number of differentially expressed genes recovered and the percentage that are true 



678 J. C. Rockett et al. 

positives), it is still not clear if such adaptations are practically effective — proving 
efficiency by spiking with a known amount of limited numbers of artificial 
construct(s) is one thing, but isolating a high percentage of the rare messages already 
present in an mRNA population is another. Of course, some models will genuinely 
produce only a small number of differentially expressed genes. In addition, there are 
also technical problems that can reduce efficiency. For example, mRNAs may have 
an unusual primary structure that effectively prevents their amplification by PCR- 
based systems. In addition, it is known that under certain circumstances not all 
mRNAs have 3'polyA sites. For example, during Xenopus development, deadenyl- 
ation is used as a means to stabilize RNAs (Voeltz and Steitz 1998), whilst 
preferential deadenylation may play a role in regulating Hsp70 (and perhaps, 
therefore, other stress protein) expression in Dr osophila {DeW^vBlie et al. 1994). The 
presence of deadenylated mRNAs would clearly reduce the efficiency of systems 
utilizing a polydT reverse transcription step. The efficiency of any system also 
depends on the quality of the starting material. All differential display techniques 
use mRNA as their target material. However, it is difficult to isolate mRNA that is 
completely free of ribosomal RNA. Even if polydT primers are used to prime first 
strand cDNA synthesis, ribosomal RNA is often transcribed to some degree 
(Clontech PCR-Select cDNA Subtraction kit user manual). It has been shown, at 
least in the case of SSH, that a high rRNA:mRNA ratio can lead to inefficient 
subtractive hybridization (Clontech PCR-Select cDNA Subtraction kit user 
manual), and there is no reason to suppose that it will not do likewise in other SH 
approaches. Finally, those techniques that utilise a presubtraction amplification step 
(e.g. RDA) may present a skewed representation since some sequences amplify 
better than others. 

Of course, probably the most important consideration is the temporal factor. It 
is clear that any given differential display experiment can only interrogate a cell at 
one point in time. It may well be that a high percentage of the genes showing altered 
expression at that time are obtained. However, given that disease processes and 
responses to environmental stimuli involve dynamic cascades of signalling, 
regulation, production and action, it is clear that all those genes which are switched 
on/off at different times will not be recovered and, therefore, vital information may 
well be missed. It is, therefore, imperative to obtain as much information about the 
model system beforehand as possible, from which a strategy can be derived for 
targeting specific time points or events that are of particular interest to the 
investigator. One way of getting round this problem of single time point analysis is 
to conduct the experiment over a suitable time course which, of course, adds 
substantially to the amount of work involved. 



How sensitive are differential expression technologies? 

There has been little published data that addresses the issue of how large the 
change in expression must be for it to permit isolation of the gene in question with 
the various differential expression technologies. Although the isolation of genes 
whose expression is changed as little as 1.5-fold has been reported using SSH 
(Groenink and Leegwater 1996), it appears that those demonstrating a change in 
excess of 5-fold are more likely to be picked up. Thus, there is a *grey zone* 
in between where small changes could fade in and out of isolation between 



Differential gene expression 679 

experiments and animals. DD, on the other hand, is not subject to this grey 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et al, (1996) reported that differences in expression of 
twofold or more are detectable using DD. 

Resolution and visualization ofd^erential expression products 

It seems highly improbable with current technology that a gel system could be 
developed that is able to resolve all gene species showing altered expression in any 
given test system (be it SH- or DD -based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.270 (Sambrook et al, 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et al. 1996, Smith et al. 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et al. 
1997). One possible solution was offered by Mathieu-Daude et al. (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size by around 
1.5-2% (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE — the inclusion of HA-red (10-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectively 
(Wawer et al. 1995, Hanse Analytik 1997, personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 % (Wawer et al. 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which differed by only a single point mutation 
(Hanse Analytik 1996, personal communication). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
experiment are derived from the same gene species, a small amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 



680 



y. C, Rockett et al. 



'^^ Band I Band 2 Band 3 



Band 4 Band 5 




Band 4 Band 5 



Figure 10. Discrimination of clones of identical/ nearly identical size using HA-red. Bands of decreasing 
size (1-5) were extracted from the final display of a suppression subtractive hybridization 
experiment and cloned. Seven colonies were picked at random from each cloned band and their 
inserts amplified using PGR. The products were run on two gels, (A) a high resolution 2% agarose 
gel, and (B) a high resolution 2 To agarose gel containing 1 U/ml HA-red. With few exceptions, all 
the clones from each band appear to be the same size (gel A), However, the presence of HA-red 
(gel B), which separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicates the presence of different gene species within each band. For 
example, even though all five re-amplified clones of band 1 appear to be the same size, at least four 
different gene species are represented. 



in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD -derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is first carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC /AT content. However, 
even these species are not unresolvable given some effort — again, one might use 
SSCP, or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band, 
either directly on the extracted band (Suzuki et aL 1991) or on the reamplified 
product. 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use 2-D gels such as those described by Uitterlinden et 
aL (1989) and Hatada et aL (1991). 



Differential gene expression 



681 



Extraction of differentially expressed bands from a gel can be complex since, in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem , and that of the use of radioisotopes, 
has been addressed by several groups. For example, Lohmann et al. (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et al. (1996) avoided the use of radioisotopes by transferring a 
small amount (20-30%) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PGR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH and RD A is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstaining 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 

The possible use of 'microfingerprinting * to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display — a * sub-fingerprint' or 'micro- 
fingerprint*. In this case, one could concentrate on those bands which only appear 
in a particular chosen size region. Reducing the fingerprint in this way has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 100-3000 + bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain * relevant * genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
cellular effects. If the prognosis for exposure to one or more other chemicals which 
display a similar profile is already known, then one could perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 



682 y. C. Rockett et al. 

An alternative approach to microfingerprinting is to examine altered expression 
in specific families of genes through careful selection of PCR primers and /or post- 
reaction analysis. Stress genes, growth factors and/or their receptors, cell cycling 
genes, cytochromes P450 and regulatory proteins might be considered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arrays (e.g. Clontech's 
Atlas cDNA Expression Array series) already anticipated this to some degree by 
grouping together genes involved in different responses e.g. apoptosis, stress, DNA- 
damage response etc. 



Screening 

False positives 

The generation of false positives has been discussed at length amongst the 
differential display community (Liang etaL 1993, 1995, Nishio etal. 1994, Sunet al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, in RDA, the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997), whilst in DD they can arise through 
PCR artifacts and illegitemate transcription of rRNA. In SH, false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDNA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes synthesized 
from tester and driver mRNA are hybridized to an array of said clones (Hedrick et 
al. 1984, Sakaguchi et al. 1986). Differentially expressed clones will hybridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDNA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression using a more quantitative 
approach. Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in turn leads to a reduced 
confidence in the result — several families of genes have members whose DNA 
sequences are almost identical except in a few key stretches, e.g. the cytochrome 
P450 gene superfamily (Nelson et al. 1996). Thus, does the clone identified as being 
almost identical to gene really come from that gene, or its brother gene Xj or its 
as yet undiscovered sister X^ ? For example, using SSH, part of a gene was isolated, 



Differential gene expression 



683 



which was up-regulated in the liver of rats exposed to Wy-14,643 and was identified 
by a FASTA search as being transferrin (data not shown). However^ transferrin is 
known to be downregulated by hypolipidemic peroxisome proliferators such as Wy- 
14,643 (Hertz et aL 1996), and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may belong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with SH technology is redundancy. In most cases 
before SH is carried out, the cDN A population must first be simplified by restriction 
digestion. This is important for at least two reasons: 

(1) To reduce complexity — long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

(2) Cutting the cDNAs into small fragments provides better representation of 
individual genes. This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that may cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDNA may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point — at what degree 
of sequence similarity does one accept a result. Is 90% identitiy between a gene 
derived from your model species and another acceptably close? Is 95% between 
your sequence and one from the same species also acceptable ? This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene species! An arbitrary decision 
seems to be to allocate genes that are definite (95% and above similarity) and then 
group those between 60 and 95% as being related or possible homologues. 

Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1), this is a major problem. Consequently, RT-PCR may be the 
method of choice for confirming differential expression. Although the procedure is 
somewhat more complex than Northern analysis, requiring synthesis of primers and 
optimization of reaction conditions for each gene species, it is now possible to set up 
high throughput PCR systems using mulitchannel pipettes, 96 +-well plates and 



684 



y. C. Rockett et al. 



appropriate thermal cycling technology. Whilst quantitative analysis is more 
desirable, being more accurate and without reliance on an internal standard, the 
money and time needed to develop a competitor molecule is often excessive, 
especially when one might be examining tens or even hundreds of gene species. The 
use of semi-quantitative analysis is simpler, although still relatively involved. One 
must first of all choose an internal standard that does not change in the test cells 
compared to the controls. Numerous reference genes have been tried in the past, for 
example interferon-gamma (IFN-y, Frye et aL 1989), )3-actin (Heuval et al. 1994), 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH, Wong et aL 1994), di- 
hydrofolate reductase (DHFR, Mohler and Butler 1991), ^2-microglobulin (/?-2- 
m, Murphy et aL 1990), hypoxanthine phosphoribosyl transferase (HPRT, Foss et 
aL 1998) and a number of others (ClonTechniques 1997b). Ideally, an internal 
standard should not change its level of expression in the cell regardless of cell age, 
stage in the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping genes currently 
used by the research community do in fact change under certain conditions and in 
different tissues (ClonTechniques 1997b). It is imperative, therefore, that pre- 
liminary experiments be carried out on a panel of housekeeping genes to establish 
their suitability for use in the model system. 

Interpretation of quantitative data must also be treated with caution. By 
comparing the lists of genes identified by differential expression one can perhaps 
gain insight into why two different species react in different ways to external stimuli. 
For example, rats and mice appear sensitive to the non-genotoxic effects of a wide 
range of peroxisome proliferators whilst Syrian hamsters and guinea pigs are largely 
resistant (Orton et aL 1984, Rodricks and Turnbull 1987, Lake et aL 1989, 1993, 
Makowska et aL 1992). A simplified approach to resolving the reason(s) why is to 
compare lists of up- and down-regulated genes in order to identify those which are 
expressed in only one species and, through background knowledge of the effects of 
the said gene, might suggest a mechanism of facilitated non-genotoxic carcinogenesis 
or protection. Of course, the situation is likely to be far more complex. Perhaps if 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
upregulated 50 times by PPs, the same gene might only be up-regulated five times 
in the rat. However, since both were noted to be upregulated, the importance of the 
gene may be overlooked. Just to complicate matters, a large change in expression 
does not necessarily mean a biologically important change. For example, what is the 
true relevance of gene Y which shows a 50-fold increase after a particular treatment, 
and gene Z which shows only a 5-fold increase? If one examines the literature one 
may find that historically, gene Y has often been shown to be up-regulated 40-60- 
fold by a number of unrelated stimuli — in light of this the 50-fold increase would 
appear less significant. However, the literature may show that gene Z has never been 
recorded as having more than doubled in expression — which makes your 5-fold 
increase all the more exciting. Perhaps even more interesting is if that same 5-fold 
increase has only been seen in related neoplasms or following treatment with related 
chemicals. 

Problems in using the differential display approach 

Differential display technology originally held promise of an easily obtainable 
* fingerprint ' of those genes which are up- or down-regulated in test animals /cells in 
a developmental process or following exposure to given stimuli. However, it has 



Differential gene expression 



685 



become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all diiferential display 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which diiferential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual poly- 
morphisms. Polymorphic variations, small though they usually are, are often 
regarded as being of paramount importance in explaining why some patients 
respond better than others to certain drug treatments (and, in logical extension, why 
some people are less affected by potentially dangerous xenobiotics /carcinogens than 
others). The identification of such point mutations and naturally occurring 
polymorphisms requires the subsequent application of sequencing, SSCP, DGGE 
or TGGE to the gene of interest. Furthermore, differential display is not designed 
to address issues such as alternatively spliced gene species or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRNA 
stability. 



Conclusions 

Perhaps the main advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRNA and carry out Northern /PGR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arrays. Arrays are easier and faster to prepare and use, provide quantitative data, are 
suitable for high throughput analysis and can be tailored to look at specific signalling 
pathways or families of genes. Identification of all the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed 
genes using the technically more demanding open system approach. Thus, their 
main advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysis of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentially expressed genes? One 
persistent problem is understanding whether differentially expressed genes are a 
cause or consequence of the altered state. Furthermore, many chemicals, such as 
non-genotoxic carcinogens, are also mitogens and so genes associated with 
replication will also be upregulated but may have little or nothing to do with the 



686 J. C. Rockett et al. 

carcinogenic effect. Whilst differential display technology cannot hope to answer 
these questions, it does provide a springboard from which identification, regulatory 
and functional studies can be launched. Understanding the molecular mechanism of 
cellular responses is almost impossible without knowing the regulation and function 
of those genes and their condition (e.g. mutated). In an abstract sense, differential 
display can be likened to a still photograph, showing details of a fixed moment in 
time. Consider the Historian who knows the outcome of a battle and the placement 
and condition of the troops before the battle commenced, but is asked to try and 
deduce how the battle progressed and why it ended as it did from a few still 
photographs — an impossible task. In order to understand the battle, the Historian 
must find out the capabilities and motivation of the soldiers and their commanding 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and consider the effects the prevailing weather 
conditions exerted. Likewise, if mechanistic answers are to be forthcoming, the 
scientist must use differential display in combination with other techniques, such as 
knockout technology, the analysis of cell signalling pathways, mutation analysis and 
time and dose response analyses. Although this review has emphasized the 
importance of differential gene profiling, it should not be considered in isolation and 
the full impact of this approach will be strengthened if used in combination with 
functional genomics and proteomics (2-dimensional protein gels from isoelectric 
focusing and subsequent SDS electrophoresis and virtual 2D-maps using capillary 
electrophoresis). Proteomics is attracting much recent attention as many of the 
changes resulting in differential gene expression do not involve changes in mRNA 
levels, as decribed extensively herein, but rather protein-protein, protein-DNA and 
protein phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that many 
potential applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in response 
to chemical or biological insult. In light of functional data, such profiling will 
provide a 'fingerprint* of each stage of development or response, and in the long 
term should help in the elucidation of specific and sensitive biomarkers for different 
types of chemical/biological exposure and disease states. The potential medical and 
therapeutic benefits of understanding such molecular changes are almost im- 
measurable. Amongst other things, such fingerprints could indicate the family or 
even specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment. 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neoplasia and, again, perhaps indicate the 
most eflficacious treatment. 

The Human Genome Project will be completed early in the next century and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 

Acknowledgem ents 

We acknowledge Drs Nick Plant (University of Surrey), Sally Darney and Chris 
Luft (US EPA at RTP) for their critical analysis of the manuscript prior to 
submission. This manuscript has been reviewed in accordance with the policy of the 



Differential gene expression 



6S7 



US Environmental Protection Agency and approved for publication. Approval does 
not signify that the contents reflect the views and policies of the Agency, nor does 
mention of trade names constitute endorsement or recommendation for use. 

References 

Adams, M. D., Kelley , J. M., Gocayne, J. D., Dubnick , M., Polymeropoulos , M. H., Xiao, H., 

Merril, C. R., Wu, a., Olde, B., Moreno , R, F., Kerlavage, A. R., McCombie , W. R. and 

Ventor , J. C, 1991, Complementary DNA sequencing: expressed sequence tags and human 

genome project. Science, 252, 1651-1656. 
An, G., Lug, G., Veltri , R. W. and O'Hara, S. M., 1996, Sensitive non-radioactive differential display 

method using chemiluminescent detection. BiotechniqtieSy 20, 342-346. 
Axel , R ., Feigelson , P. and Schultz , G . , 1 976, Analysis of the complexity and diversity of mRNA from 

chicken liver and oviduct. Cell, 7, 247-254. 
Band, V. and Saoer, R., 1989, Distinctive traits of normal and tumor-derived human mammary 

epithelial cells expressed in a medium that supports long-term growth of both cell types. 

Proceedings of the Naiottal Academy of Sciences, US Ay 86,1249-1253. 
Bauer, D., Muller, H., Reich , J., Riedel , H., Ahrenkeel , V., Warthoe, P. and Strauss, M., 1993, 

Identification of differentially expressed mRNA species by an improved display technique 

(DDRT-PCR). Nucleic Acids Research, 21, 4272-4280. 
Bertioli , D. J., Schlichter , U. H. A., Adams, M. J., Burrows, P. R., Steinbiss ,H.-H. and Antoniw , 

J. F., 1995, An analysis of differential display shovirs a strong bias towards high copy number 

mRNAs. Nucleic Acids Research, 23, 4520-4523. 
Bravo, R., 1990, Genes induced during the GO/Gl transition in mouse fibroblasts. Seminars in Cancer 

Biology, 1, 37—46. 

Burn, T. C, Petrovick , M. S., Hohaus, S., Rollins , B. J. and Tenen , D. G., 1994, Monocyte 

chemoattractant protein-1 gene is expressed in activated neutrophils and retinoic acid-induced 

human myeloid cell lines. Blood, 84, 2776-2783. 
Cao, J., Cai, X., Zheng, L., Geno, L., Shi, Z., Pao, C. C. and Zheng, S., 1997, Characterisation of 

colorectal cancer-related cDNA clones obtained by subtractive hybridisation screening. JoiirKa/o/ 

Cancer Research and Clinical Oncology, 123, 447-451. 
Cassidy , S. B., 1995, Uniparental disomy and genomic imprinting as causes of human genetic disease. 

Environmental and Molecular Mutagenesis, 25 (Suppl 26), 13-20. 
Chang , G. W. and Terzaghi-Howe, M., 1998, Multiple changes in gene expression are associated with 

normal cell-induced modulation of the neoplastic phenotype. Cancer Research, 58, 4445-4452. 
Chen , J., Schwartz, D. A., Young , T. A., Norris, J. S. and Yager, J. D., 1996, Identification of genes 

whose expression is altered during mitosuppression in livers of ethinyl estradiol-treated female 

rats. Carcinogenesis, 17, 2783-2786. 
Chen , J. J. W. and Peck, K., 1996, Non-radioactive differential display method to directly visualise and 

amplify differential bands on nylon membrane. Nucleic Acid Research, 24, 793-794. 
Clon Techniques , 1997a, PCR-Select Differential Screening Kit— the nextstep after Clontech PCR- 

Select cDNA subtraction. ClonTechniques, XII, 18-19. 
Clon Techniques , 1997b, Housekeeping RT-PCR amplimers and cDNA probes. ClonTechniques, XII, 

15-16. 

Davis , M. M., Cohen , D. I., Nielsen , E. A., Steinmetz , M., Paul, W. E. and Hood, L., 1984, Cell- 
type-specific cDNA probes and the murine I region : the localization and orientation of Ad alpha. 
Proceedings of the National Academy of Sciences {USA), 81, 2194-2198. 

Dellavalle , R. P., Peterson , R. and Lindquist , S., 1994, Preferential deadenylation of HSP70 mRNA 
plays a key role in regulating Hsp70 expression in Drosophila melanogaster. Molecidar and Cell 
Biology, 14, 3646-3659, 

DeRisi , J. L., Vashwanath , R. L. and Brown, P., 1997, Exploring the metabolic and genetic control of 
gene expression on a genomic scale. Science, 278, 680-686. 

DiATCHENKo , L. , Lau, Y.-F . C, , Campbell ,A.P.,Chenchik , A.,Moqadam , F.,Huang, B., Lukyanov , 
K., Gurskaya, N., Sverdlov , E. D. and Siebert , P. D., 1996, Suppression subtractive 
hybridisation: A method for generating differentially regulated or tissue-specific cDNA probes 
and libraries. Proceedings of the National Academy of Sciences {USA), 93, 6025-6030. 

DoGRA, S. C, Whitelaw , M. L. and May, B. K., 1998, Transcriptional activation of cytochrome P450 
genes by different classes of chemical inducers. Clinical and Experimental Pharmacology and 
Physiology, 25, 1-9. 

DUGUID, J. R. and Dinauer , M. C, 1990, Library subtraction of in vitro cDNA libraries to identify 
differentially expressed genes in scrapie infection. Nucleic Acids Research, 18, 2789-2792. 

Dunbar, P. R., Ogg, G. S., Chen , J., Rust, N., van der Bruggen, P. and Cerundolo , V., 1998, Direct 
isolation, phenotyping and cloning of low-frequency antigen-specific cytotoxic T lymphocytes 
from peripheral blood. Current Biology, 26, 413-416. 



688 



J. C, Rockett et al. 



FiTZPATRiCK ,D. R., Germain -Lee, E. and Valle, D., 1995, Isolation and characterisation of rat and 
human cDNAs encoding a novel putative peroxisomal enoyl-CoA hydratase. Genomics, 27, 
457-466. 

Foss, D. L., Baarsch, M.J. and Murtaugh, M.P., 1998, Regulation of hypoxanthine phospho- 

ribosyltransferase, glyceraldehyde-3-phosphate dehydrogenase and beta-actin mRNA expression 

in porcine immune cells and tissues. Animal Biotechnology^ 9, 67-78 . 
Frye, R. a., Benz, C. C. and Liu, E., 1989, Detection of amplified oncogenes by differential polymerase 

chain reaction. Oncogene, 4, 1153-1157. 
Geisinger , A., Rodriguez, R., Romero , V. and Wettstein R., 1997, A simple method for screening 

cDNAs arising from the cloning of RNA differential display bands. Elsevier Trends Journals 

Technical Tips Online, http :/ Ato. trends, com, document TOlllO. 
Gress, T. M., Hoheisel , J. D,, Lennon , G. G., Zehetner , G. and Lehrach , H., 1992, Hybridisation 

fingerprinting of high density cDNA filter arrays with cDNA pools derived from whole tissues. 

Mammalian Genome, 3, 609-619. 
Griffin , G, and Krishna , S., 1998, Cytokines in infectious diseases. Journal of the Royal College of 

Physicians, London, 32, 195-198. 
Groenink , M. and Leegwater , A. C. J., 1996, Isolation of delayed early genes associated with liver 

regeneration using Clontech PCR-select subtraction technique. Clontechniques , XI, 23-24. 
Guimaraes , M. J., Bazan, J. F., Zlotnik , A., Wiles, M.V., Grimaldi , J.C., Lee, F. and 

McClanahan , T., 1995b, A new approach to the study of haematopoietic development in the yolk 

sac and embryoid bodies. Development, 121, 3335-3346. 
GuiMERAEs , M. J., Lee, F., Zlotnik, A. and McClanahan, T., 1995a, Differential display by 

PCR:novel findings and applications. Nucleic Acids Research, 23, 1832-1833, 
GuRSKAYA, N. G., DiATCHENKO , L., Chenchik , P. D., Siebert , P. D., Khaspekov , G. L., Lukyanov , 

K.A., Vagner, L.L., Ermolaeva , O. D., Lukyanov, S. A. and Sverdlov , E. D., 1996, 

Equalising cDNA subtraction based on selective suppression of polymerase chain reaction: 

Cloning of Jurkat cell transcripts induced by phytohemaglutinin and phorbol 12-Myrystate 13- 

Acetate. Analytical Biochemistry, 240, 90-97. 
Hampson , I. N. and Hampson , L., 1997, CCLS and DROP — subtractive cloning made easy. Life Science 

News (A publication of Amersham Life Science), 23, 22-24. 
Hampson , I. N., Hampson , L. and Dexter, T. M., 1996, Directional random oligonucleotide primed 

(DROP) global amplification of cDNA: its application to subtractive cDNA cloning. Nucleic 

Acids Research. 24, 4832^835. 
Hampson , I. N., Pope, L., Cowling , G. J. and Dexter , T. M., 1992, Chemical cross linking subtraction 

(CCLS): a new method for the generation of subtractive hybridisation probes. Nucleic Acids 

Research, 20, 2899. 

Hara, E., Kato, T., Nakada, S., Sekiya , S. and Oda, K., 1991, Subtractive cDNA cloning using 

oligo(dT)30-latex and PCR: isolation of cDNA clones specific to undifferentiated human 

embryonal carcinoma cells. Nucleic Acids Research, 19, 7097-7104. 
Hatada, I., Hayashizake, Y., Hirotsune , S., Komatsubara , H. and Mukai, T., 1991, A genomic 

scanning method for higher organisms using restriction sites as landmarks. Proce&iings of the 

National Academy of Sciences {USA), 88, 9523-9527. 
Hecht, N., 1998, Molecular mechanisms of male sperm cell differentiation. Bioessays, 20, 555-561. 
Hedrick, S., Cohen, D. I., Nielsen , E. A. and Davis, M, E., 1984, Isolation of T cell-specific 

membrane-associated proteins. Nature, 308, 149-153. 
Hertz, R., Seckbach , M., Zakin , M. M. and Bar-Tana, J., 1996, Transcriptional suppression of the 

transferrin gene by hypolipidemic peroxisome proliferators. J'oi^ry/a/ of Biological Chemistry, 271, 

218-224, 

Heuval , J. P. v., Clark, G, C, Kohn , M. C, Tritscher , A. M., Greenlee , W. F., Lucer , G. W. and 
Bell, D. A., 1994, Dioxin-responsive genes: Examination of dose-response relationships using 
quantitative reverse transciptase- polymerase chain reaction. Cancer Research, 54, 62-68. 

HiLLiER , L. D., Lennon , G., Becker, M., Bonaldo , M. F., Chl\pelli , B,,Chissoe , S., Dietrich , N., 
DuBuQUE, T., Favello , A., Gish , W., Hawkins , M., Hultman , M., Kucaba, T., Lacy, M., Le, 
M., Le,N., Mardis, E., Moore, B., Morris, M., Parsons, J., France, C,,Rifkin , L,, Rohlrng , 
T., Schellenberg , K., Soares, M. B., Tan, F., Thierry -Meg, J., Trevaskis , E., Underwood , 
K., Wohldman , P., Waterston , R., Wilson , R and Marra, M., 1996, Generation and analysis 
of 280,000 human expressed sequence tags. Genome Research, 6, 807-828. 

Hubank, M. and Schatz, D. G., 1994, Identifying differences in mRNA expression by representational 
difference analysis. Nucleic Acids Research, 22, 5640-5648. 

Hunter, T., 1991, Cooperation between oncogenes. Cell, 64, 249-270. 

Ivanova ,N. B.and Belyavsky ,A. V., 1995, Identification of differentially expressed genes by restriction 

endonuclease -based gene expression fingerprinting. Nucleic Acids Research, 23, 2954—2958. 
James , B, D, and Higgins , S, J, 1985, Nucleic Acid Hybridisation (Oxford: IRL Press Ltd). 
Kas-Deelen , a. M., Harmsen , M. C, de Maar, E. F. and van Son, W. J, 1998, A sensitive method for 



f 



Differential gene expression 689 

quantifying cytomegalic endothelial cells in peripheral blood from cytomegalovirus-infected 

patients. Clinical Diagnostic and Laboratory Immunology ^ 5, 622-626. 
Kilty , I. and Vickers , P., 1997, Fractionating DNA fragments generated by differential display PGR. 

Strategies Newsletter (Stratagene), 10, 50-51. 
Kleinjan , D.-J. and van Heyningen , V., 1998, Position effect in human genetic disease. Human attd 

Molecular Genetics, 7, 1611-1618. 
Ko, M.S., 1990, An 'equalized cDNA library' by the reassociation of short double-stranded cDNAs. 

Nucleic Acids Research, 18, 5705-5711. 
Lake, B. G., Evans, J. G., Cunninghame , M. E. and Price, R, J., 1993, Comparison of the hepatic 

effects of Wy-14,643 on peroxisome proliferation and cell replication in the rat and Syrian 

hamster. Environmetital Health Perspectives, 101, 241-248. 
Lake, B. G., Evans , J. G., Gray, T. J. B., Korosi , S. A. and North , C. J., 1989, Comparative studies 

of nafenopin-induced hepatic peroxisome proliferation in the rat, Syrian hamster, guiea pig and 

marmoset. Toxicology and Applied Pharmacology*, 99, 148-160. 
Lennard, M.S., 1993, Genetically determined adverse drug reactions involving metabohsm. Drug 

Safety, 9, 60-77. 

Levy, S., Todd, S. C. and Maecker, H. T., 1998, CD81(TAPA-1): a molecule involved in signal 

transduction and cell adhesion in the immune system. Annual Review of Immunology, 16, 89-109. 
Liang , P. and Pardee, A. B., 1992, Differential display of eukaryotic messenger RNA by means of the 

polymerase chain reaction. Science, 257, 967-971. 
Liang, P., Averboukh , L., Keyomarsi , K., Sager, R. and Pardee, A., 1992, Differential display and 

cloning of messenger RNAs from human breast cancer versus mammary epithelial cells. Cancer 

Research, SI, 6966-6968. 

Liang, P., Averboukh , L. and Pardee, A. B., 1993, Distribution & cloning of eukaryotic mRNAs by 

means of differential display refinements and optimisation. Nucleic Acids Research, 21, 3269-3275. 
Liang, P., Bauer, D., Averboukh , L., Warthoe, P., Rohrwild , M., Muller, H,, Strauss, M. and 

Pardee, A. B., 1995, Analysis of altered gene expression by differential display. Methods in 

Enzymology, 254, 304-321. 
LiNSKENS , M. H., Feng, J., Andrews, W. H., Enlow^ , B. E., Saati, S. M., Tonkin , L. A., Funk, 

W. D. and Villeponteau , B., 1995, Cataloging altered gene expression in young and senescent 

cells using enhanced differential display. Nucleic Acids Research, 23, 3244-3251. 
LisrrsYN , N., Lisiitsyn , N. and Wigler , M., 1993, Cloning the differences between two complex 

genomes. Science, 259, 946-951. 
LoHMANN , J., Schickle , H. and Bosch , T. C, G., 1995, REN Display, a rapid and efficient method for 

non-radioactive differential display and mRNA isolation. Biotechniques, 18, 200-202. 
Lunney , J. K., 1998, Cytokines orchestrating the immune response. Reviews in Science and Techology, 

17, 84-94. 

Makowska, J.M., Gibson, G. G. and Bonner, F.W., 1992, Species differences in ciprofibrate- 
induction of hepaic cytochrome P4504A1 and peroxisome proliferation. Journal of Biochemical 
Toxicology, 7 , 1 83-1 91 . 

Maldarelli , F., XiANG , C, Chamoun , G. and Zeichner , S. L., 1998, The expression of the essential 
nuclear splicing factor SC35 is altered by human immunodeficiency virus infection. Virus 
Research, 53, 39-51. 

Mathieu -Daude, F., Cheng , R., Welsh , J. and McClelland , M., 1996, Screening of differentially 
amplified cDNA products from RNA arbitrarily primed PCR fingerprints using single strand 
conformation polymorphism (SSCP) gels. Nucleic Acids Research, 24, 1504-1507. 

McK enzie , D. and Drake, D., 1997, Identification of differentially expressed gene products with the 
castaway system. Strategies Newsletter (Stratagene), 10,19-20. 

McClelland, M., Mathieu -Daude, F. and Welsh, J., 1996, RNA fingerprinting and differential 
display using arbitrarily primed PCR. Trends in Geiietics, 11, 242-246. 

Mechler, B. and Rabbitts , T. H., 1981, Membrane-bound ribosomes of myeloma cells. IV. mRNA 
complexity of free and membrane-bound polysomes. Journal of Cell Biology, 88, 29-36. 

Meyer, U. A. and Zanger, U. M., 1997, Molecular mechanisms of genetic polymorphisms of drug 
metabolism. Annual Review of Pharmacology and Toxicology^ 37, 269-296. 

MoHLER , K. M. and Butler , L. D., 1991, Quantitation of cytokine mRNA levels utilizing the reverse 
transcriptase-polymerase chain reaction following primary antigen-specific sensitization in 
vivo — I. Verification of linearity, reproducibility and specificity. Molecular Immunology, 28, 
437-447. 

Murphy, L. D., Herzog, C. E., Rudick, J. B., Tito Fojo, A. and Bates, S. E., 1990, Use of the 
polymerase chain reaction in the quantitation of the mdr-1 gene expression. Biochemistry, 29, 
10351-10356. 

Nelson, D. R., Koymans , L., Kamataki , T., Stegeman , J. J., Feyereisen , R., Waxman, D. J., 
Waterman , M. R., Gotoh ,0., Coon , M. J., Estabtrook , R. W., Gunsalus , I. C. and Nebert , 
D. W., 1996, Update on new sequences, gene mapping, accession numbers and nomenclature. 
Pharmacogenetics, 6, 1—42. 



690 J. C, Rockett et al. 

NisHio , Y., AiELLO , L. P. and King , G. L., 1994, Glucose induced genes in bovine aortic smooth muscle 

cells identified by mRNA diiferential display. FASEB Journal, 8, 103-106. 
O^Neill , M. J. and Sinclair , A. H., 1997, Isolation of rare transcripts by representational difference 

analysis. Nucleic Acids Research, 25, 2681-2682. 
Orton, T. C, Adam, H. K., Bentley , M., Holloway , B. and Tucker, M. J., 1984, Clobuzarit: species 

dilferences in the morphological and biochemical response of the liver following chronic 

administration. Toxicology ami Applied Pharmacology, 73, 138-151. 
Pelkonen , O., Maenpaa, J., TAAvrrsAiNEN , P., Rautio, a. and Raunio , H., 1998, Inhibition and 

Induction of human cytochrome P450 (CYP) enzymes. Xenobiotica , 28, 1203-1253. 
Philips , S. M., Bendall , A. J. and Ramshaw , I. A., 1990, Isolation of genes associated with high 

metastatic potential in rat mammary adenocarcinomas. Joimml of the National Cancer Institute, 

82, 199-203. 

Prashar, Y. and Weissman , S. M., 1996, Analysis of differential gene expression by display of 3'end 
restriction fragments of cDNAs. Proceedings of the National Academy of Sciences {USA), 93, 
659-663. 

Ragno, S., Estrada, I., Butler, R. and Colston, M. J., 1997, Regulation of macrophage gene 

expression following invasion by Mycobacterium tuberculosis. Immunology Letters, 57, 143-146. 
Ramana , K. V. and Kohli , K. K., 1998, Gene regulation of cytochrome P450 — an overview. Indian 

Journal of Experimen tal Biology ,36, 437-446. 
Richard , L., Velasco , P. and Detmar , M., 1998, A simple immunomagnetic protocol for the selective 

isolation and long-term culture of human dermal microvascular endothelial cells. Experimental 

Cell Research, 240, 1-6. 

Rockett, J.C, Esdaile , D.J. and Gibson, G. G., 1997, Molecular profiling of non-genotoxic 

hepatocarcinogenesis using differential display reverse transcription-polymerase chain reaction 

(ddRT-PCR). European Journal of Drug. Metabolism and Pharmacokinetics , 22, 329-333. 
Rodricks, J. V. and Turnbull , D., 1987, Inter-species differences in peroxisomes and peroxisome 

proliferation. Toxicology and Industrial Health, 3, 197-212. 
Rogler, G., Hausmann , M., Vogl, D., Aschenbrenner , E., Andus, T., Falk, W., Andreesen , R., 

Scholmerich , J. and Gross, V., 1998, Isolation and phenotypic characterization of colonic 

macrophages. Clinical and Experimental Immunology, 112, 205-215. 
Rohn , W. M., Lee, Y. J. and Benveniste , E. N., 1996, Regulation of class II MHC expression. Critical 

Reviews in Immunology, 16, 311-330. 
Rudin , C. M. and Thompson , C. B., 1998, B-cell development and maturation. Seminars hi Oncology, 

25, 435^46. 

Sakaguchi, N., Berger, C, N. and Melchers , F., 1986, Isolation of a cDNA copy of an RNA species 

expressed in murine pre-B cells. EM BO Journal, 5, 2139-2147. 
Sambrook , J,, Fritsch , E. F. and Maniatis , T., 1989, Gel electrophoresis of DNA. In N. Ford, M. 

Nolan and M. Fergusen (eds), Molecular Cloning — A laboratory manual, 2nd edition (New York : 

Cold Spring Harbour Laboratory Press), Volume 1, pp. 6-37. 
Sargent, T. D. and Dawid, L B., 1983, Differential gene expression in the gastrula of Xenopus laevis. 

Science, 222, 135-139. 

Schena , M., Shalon , D., Heller , R.,Chai, A., Brown., P. O. and Davis , R. W., 1996, Parallel human 

genome analysis: Microarray-based expression monitoring of 1000 genes. Proceedings of the 

National Academy of Sciences {USA), 93, 10614-10619. 
Schneider , C, King , R. M. and Philipson , L., 1988, Genes specifically expressed at growth arrest of 

mammalian cells. Cell, 54, 787-793. 
Schnhder -Maunoury , S., Gilardi -Hebenstreit , P. and Charnay , P., 1998, How to build a vertebrate 

hindbrain. Lessons from genetics. C R Academy of Science III, 321, 819-834. 
Semenza , G. L., 1994, Transcriptional regulation of gene expression ; mechanisms and pathophysiology. 

Human Mutations, 3, 180-199. 
Sewall , C. H., Bell, D. A., Clark, G. C, Tritscher , A. M., Tully, D. B., Vanden Heuvel , J. and 

Lucier , G. W., 1995, Induced gene transcription: implications for biomarkers. Clinical 

Chemistry, 41, 1829-1834. 
Singh , N., Agrawal, S. and Rastogi , A. K., 1997, Infectious diseases and immunity: special reference 

to major histocompatibility complex. Emerging Infectious Diseases, 3, 41-49. 
Smith , N. R,, Li, A., Aldersley , M., High , A. S., Markham, A. F. and Robinson , P. A., 1997, Rapid 

determination of the complexity of cDNA bands extracted from DDRT-PCR polyacrylamide 

gels. Nucleic Acids Research, 25 , 3552-3554. 
SoMPAYRAC , L., Jane, S., Burn., T. C, Tenen ,D. G. and Danna, K. J., 1995, Overcoming limitations 

of the mRNA differential display technique. Nucleic Acids Research, 23, 4738-4739. 
St John, T. P. and Davis, R. W., 1979, Isolation of galactose-inducible DNA sequences from 

Saccharomyces cerevisiae by differential plaque filter hybridisation. Cell, 16, 443-452. 
Sun, Y., Hegamyer , G. and Colburn , N. H., 1994, Molecular cloning of five messenger RNAs 

differentially expressed in preneoplastic or neoplastic JB6 mouse epidermal cells: one is 

homologous to human tissue inhibitor of metalloproteinases-3. Cancer Research, 54, 1139-1144. 



Differential gene expression 



691 



Sung, Y. J. and Denman , R. B., 1997, Use of two reverse transcriptases eliminates false -positive results 

in diiferential display. Biotechniques^ 23, 462-464. 
Sutton, G., White, O., Adams, M. and K er lavage , A., 1995, TIGR Assembler; A new tool for 

assembling large shotgun sequencing projects. Genome Science and Technology y 1, 9-19. 
Suzuki, Y., Sekiya , T. and Hayashi, K., 1991, Allele-specific polymerase chain reaction: a method for 

amplification and sequence determination of a single component among a mixture of sequence 

variants. Analytical Biochemistry, 192, 82-84. 
Syed, v., Gu, W. and Hecht , N. B., 1997, Sertoli cells in culture and mRNA differential display provide 

a sensitive early warning assay system to detect changes induced by xenobiotics. Journal of 

Andrology, 18, 264-273. 

UiTTERUNDEN , A. G., Slagboom , P., Knook, D. L, and VuGL, J., 1989, Two-dimensional DNA 
fingerprinting of human individuals. Proceedings of the National Academy of Sciences {USA)y 86, 
2742-2746. 

U llman , K. S., Northrop , J. P., Verweij , C. L». and Crabtree , G. R., 1990, Transmission of signals 

from the T lymphocyte antigen receptor to the genes responsible for cell proliferation and immune 

function: the missing link. Annual Review of Immunology ^ 8, 421-452. 
Vasmatzis , G., EssAND, M., Brinkmann , U., Lee, B. and Paston, I., 1998, Discovery of three genes 

specifically expressed in human prostate by expressed sequence tag database analysis. Proceedings 

of the National Academy of Sciences {USA), 95, 300-304. 
Velculescu , V. E., Zhang, L., Vogelstein , B. and Kinzler , K. W., 1995, Serial analysis of gene 

expression. Science, 270, 484—487. 
VoELTZ, G, K. and Steitz , J. A., 1998, AuuuA sequences direct mRNA deadenylation uncoupled from 

decay during Xenopus early development. Molecular and Cell Biology, 18, 7537-7545. 
Vogelstein , B. and Kinzler, K. W., 1993, The multistep nature of cancer. Trends in Genetics, 9, 

138-141, 

Walter , J., Belfield , M., Hampson , I. and Read, C, 1997, A novel approach for generating subtractive 

probes for differential screening by COLS. Life Science News, 21, 13-14. 
Wan, J. S., Sharp, S. J., Poirier , G. M.-C, Wagaman , P. C, Chambers , J., Pyati, J., Hom , Y.-L., 

Galindo , J.E., HuvAR, A., Peterson, P. A., Jackson, M. R. and Erlander , M. G., 1996, 

Cloning differentially expressed mRNAs. Nature Biotechnology, 14, 1685-1691. 
Walter , J., Belfield , M., Hampson , L and Read, C, 1997, A novel approach for generating subtractive 

probes for differential screening by CCLS, Life Science News, 21, 13-14. 
Wang, Z. and Brown, D. D. 1991, A gene expression screen. Proceedings of the National Academy of 

Sciences {USA), 88, 11505-11509. 
Wawer, C, Ruggeberg , H., Meyer, G. and Muyzer, G., 1995, A simple and rapid electrophoresis 

method to detect sequence variation in PCR-amplified DNA fragments. Nucleic Acids Research, 

23, 4928-4929. 

Welsh , J., Chada, K., Dalal, S, S., Cheng, R., Ralph, D. and McClelland , M., 1992, Arbitrarily 
primed PCR fingerprinting of RNA. Nucleic Acids Research, 20, 4965-4970. 

Wong, H., Anderson , W. D., Cheng , T. and Riabowol , K. T., 1994, Monitoring mRNA expression 
by polymerase chain reaction: the 'primer-dropping' method. Analytical Biochemistry, 223, 
251-258. 

Wong, K. K. and McClelland , M., 1994, Stress-inducible gene of Salmonella typhimurium identified 
by arbitrarily primed PCR of RNA. Proceedings of the National Academy of Sciences {USA), 91, 
639-643. 

Wynford -Thomas , D., 1991, Oncogenes and anti-oncogenes ; the molecular basis of tumour behaviour. 

Journal of Pathology, 165, 187-201. 
Xhu, D., Chan, W. L., Leung , B. P., Huang, F. P., Wheeler , R., Pedrafita , D., Robinson , J. H. and 

Liew , F. Y., 1998, Selective expression of a stable cell surface molecule on type 2 but not type 1 

helper T cells. Journal of Experimental Medicine, 187, 787-794 . 
Yang, M. and Sytowski , A. J., 1996, Cloning differentially expressed genes by linker capture 

subtraction. Analytical Biochemistry, 237, 109-114. 
Zhao, N., Hashida , H., Takahashi , N., Misumi , Y. and Sakaki, Y., 1995, High-density cDNA filter 

analysis: a novel approach for large scale quantitative analysis of gene expression. Gette, 156, 

207-213. 

Zhao,X. J.,Newsome , J. T. and Cihlar , R. L., 1998, Up-regulation oitvjo Candida albicans genes in the 
rat model of oral candidiasis detected by differential display. Microbial Pathogenesis, 25, 121-129. 

ZiMMERMANN , C. R., Orr, W.C, Leclerc , R.F., Barnard, C. and Timberlake , W. E., 1980, 
Molecular cloning and selection of genes regulated in Aspergillus development. Cell, 21, 709-715. 



Reference 13 of 20 

with Response dated 05/04/04 

In USSN: 09/857,826 




MOLECULAR CARCINOGENESIS 24:153-159 (1999) 

IN PERSPECTIVE 

Claudio J. Conti, Editor 

Microarrays and Toxicology: The Advent of 
Toxicogenomics 

Emile F. Nu way sir/ Michael Bittner,^ Jeffrey Trent,^ J. Carl Barrett/ and Cynthia A. Afshari^ 

^Laboratory of Molecular Carcinogenesis, National Institute of Environmental Health Sciences, Research Triangle Park, 
North Carolina 

^Laboratory of Cancer Genetics, National Human Genome Research Institute, Bethesda, Maryland 

The availability of genome-scale DNA sequence information and reagents has radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general nnethod by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mol. Carcinog. 24:153- 

159, 1999. © 1999 Wiley-Liss, Inc. 

Key words: toxicology; gene expression; animal bioassay 




INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date; more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPIVIENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 



*Correspondence to: Laboratory of Molecular Carcinogenesis, 
National Institute of Environmental Health Sciences, 1 1 1 Alexander 
Drive, Research Triangle Park, NC 27709. 

Received 8 December 1998; Accepted 5 January 1999 

Abbreviations: PAH, polycyclic aromatic hydrocarbon; NIEHS, Na- 
tional Institute of Environmental Health Sciences. 



© 1999 WILEY-USS, INC 



154 



NUWAYSIR ET AL 



[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
SCTiption reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and CyS-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10,11,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [2&-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32], 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33], The signal is detected with 
a custom confocal scanner [34], This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCAl gene [37], In addition, 
mutations in the cystic fibrosis [38] and BRCAl [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus-1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain 5. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 



MICROARRAYS AND TOXICOLOGY 



155 



far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
miaoarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 



RNA Isolation 



'^^ ^. Reverse 

Transcription 






A Mix cDNAs and 
n Apply to Array 



DNA "Chip" 



7 



Hybridize Under 
Coverslip 




Figure 1. Simplified overview of the method for sample trative purposes, samples derived from eel I cult urea re depicted, 
preparation and hybridization to cDNA microarrays. For lllus- although other sample types are amenable to this analysis. 



756 



NUWAYSIR ET AL 



Known Agents 



Pulycyclic Aromatic Peroxisome 
Oxidant Stressors Hydrocarbons Frolifenitors 



Ortnip A 
<;rfN]p B 
firoup < 





®®®®®® 



®®®®®®® 




®®®®®®® 
®®®®®®® 




Toxicant 
Sigmiture 



Suspected 
Toxicant 





NoiMatctk^ 



Nci Match ^ 



^ Match 



Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 liuman genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 



MICROARRAYS AND TOXICOLOGY 



Table 1. ToxChip vl.O: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DMA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome proliferator responsive 22 

DioxIn/PAH responsive 12 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 131 

Kinases 276 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s 30 



*Thts list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost; 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



757 

also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environnnental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphoc3^es of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45], 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 



158 



NUWAYSIRETAL 



Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health, 

ACKNOWLEDGMENTS 

The authors would like to thank Drs. Robert 
Maronpot, George Lucier, Scott Masten, Nigel 
Walker,Raymond Tennant, and Ms. Theodora 
Deverenux for critical review of this manuscript. EFN 
was supported in part by NIEHS Training Grant 
#ESO7017-24. 

REFERENCES 

1 . httpyAA/ww.ncbi.nlm.nih.gov/Web/Genbank/index.html 

2. httpy/www.ncbi.nlm.nih.gov/Entrez/Genome/org.html 

3. Fleischmann RD, Adams MD, White O, et al. Whole-genome ran- 
dom sequencing and assembly of Haemophilus influenzae Rd. 
Science 1995;269:496-512. 

4. Goffeau A, Barrell BG, Bussey H, et al. Life with 6000 genes. 
Science 1996;274:546, 563-567. 

5. httpy/www.perkin-elmer.com/press/prc5448.html 

6. Liang P, Pardee AB. Differential display of eukaryotic messenger 
RNA by means of the polymerase chain reaction. Science 
1992;257:967-971. 

7. Pietu G, Alibert O, Guichard V, et al. Novel gene transcripts pref- 
erentially expressed in human muscles revealed by quantitative 
hybridization of a high density cDNA array. Genome Res 
1996;6:492-503. 

8. Zhao ND, Hashida H, Takahashi N, Misumi Y, Sakaki Y. High-den- 
sity cDNA filter analysis — A novel approach for large-scale, quan- 
titative analysis of gene expression. Gene 1995;156:207-213. 

9. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis 
of gene expression. Science 1995;270:484-487. 

10. Schena M, Shalon D, Davis RW, Brown PO. Quantitative moni- 
toring of gene-expression patterns with a complementary DNA 
microarray. Science 1995;270:467-470. 

1 1 . DeRisi J, Penland L, Brown PO, et al. use of a cDNA microarray to 
analyse gene expression patterns in human cancer. Nat Genet 
1996;14:457-460. 

12. Wodicka L, Dong HL, Mittmann M. Ho MH, Lockhart DJ. Ge- 
nome-wide expression monitoring in Saccharomyces cerevisiae. 
Nat Biotechnol 1997;15:1359-1367. 

13. Marshall A, Hodgson J. DNA chips: An array of possibilities. Nat 
Biotechnol 1998;16:27-31. 

14. http://www.synteni.com 

15. Shalon D, Smith SJ, Brown PO. A DNA microarray system for 
analyzing complex DNA samples using two-color fluorescent 
probe hybridization. Genome Res 1996;6:639-645. 

16. Chen Y, Dougherty ER. Bittner ML. Ratio-based decisions and 
the quantitative analysis of cDNA microarray images. Biomedical 
Optics 1997;2:364-374. 

17. Khan J, Simon R, Bittner M. et al. Gene expression profiling of 
alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res 
1998;58:5009-5013. 

18. Schena M, Shalon D. Heller R. Chai A, Brown PO, Davis RW. 
Parallel human genome analysis: Microarray-based expression 
monitoring of 1000 genes. Proc Natl Acad Sci USA 1996; 
93:10614-10619. 



MICROARRAYS AND TOXICOLOGY 



19. Lashkari DA. DeRisi JL. McCusker JH, et al. Yeast microarrays for 
genome wide parallel genetic and gene expression analysis. Proc 
Natl Acad Sci USA 1997;94:13057-13062. 

20. Heller RA, Schena M, Chai A, et al. Discovery and analysis of 
inflammatory disease-related genes using cDNA microarrays. Proc 
Natl Acad Sci USA 1997;94:2150-2155. 

21. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and ge- 
netic control of gene expression on a genomic scale. Science 
1997;278:680-686. 

22. Drmanac S, Stavropoulos NA, Labat I, et al. Gene-representing 
cDNA clusters defined by hybridization of 57,419 clones from 
infant brain libraries with short oligonucleotide probes. Genomics 
1996;37:29-40. 

23. Milosavljevic A, Savkovic S, Crkvenjakov R, et al. DNA sequence 
recognition by hybridization to short oligomers: Experimental 
verification of the method on the E. coli genome. Genomics 
1996;37:77-86. 

24. Drmanac S, Drmanac R. Processing of cDNA and genomic 
kilobase-size clones for massive screening, mapping and se- 
quencing by hybridization. Biotechniques 1994;17:328-329, 
332-336. 

25. httpy/www.resgen.com/ 

26. httpy/www.genomesystems.com/ 

27. httpy/www. clontech.com/ 

28. Pease AC, Solas DA, Fodor SPA. Parallel synthesis of spatially 
addressable oligonucleotide probe matrices. Abstract. Abstracts 
of Papers of the American Chemical Society 1992;203:34. 

29. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor 
SPA. Light-generated oligonucleotide arrays for rapid DNA 
sequence analysis. Proc Natl Acad Sci USA 1994;91:5022- 
5026. 

30. Fodor SPA, Read JL, Pirrung MC, Stryer L, Lu AT, Solas D. Light- 
directed, spatially addressable parallel chemical synthesis. Sci- 
ence 1991;251:767-773. 

3 1 - McGall G, Labadie J. Brock P, Wallraff G, Nguyen T, Hinsberg W. 
Light-directed synthesis of high-density oligonucleotide arrays 
using semiconductor photoresists. Proc Natl Acad Sci USA 
1996;93:13555-13560. 

32. Lipshutz RJ, Morris D, Chee M, et al. Using oligonucleotide probe 
arrays to access genetic diversity. Biotechniques 1995; 19:442-447. 

33. Lockhart DJ, Dong ML, Byrne MC, et al. Expression monitoring 



159 

by hybridization to high-density oligonucleotide arrays. Nat 
Biotechnol 1996;14:1675-1680. 

34. http://www.mdyn.com/ 

35. Sapolsky RJ, Lipshutz RJ. Mapping genomic library clones using 
oligonucleotide arrays. Genomics 1996;33:445-456. 

36. Chee M, Yang R, Hubbell E, et al. Accessing genetic information 
with high-density DNA arrays. Science 1996;274:610-614. 

37. Hacia JG, Makalowski W, Edgemon K, et al. Evolutionary se- 
quence comparisons using high-density oligonucleotide arrays. 
Nat Genet 1998;18:155-158. 

38. Cronin MT, Fucini RV, Kim SM, Masino RS, Wespi RM, Miyada 
CG. Cystic fibrosis mutation detection by hybridization to light- 
generated DNA probe arrays. Hum Mutat 1996;7:244-255. 

39. Hacia JG, Brody LC, Chee MS, Fodor SPA, Collins FS. Detection 
of heterozygous mutations in BRCA1 using high density oligo- 
nucleotide arrays and two-colour fluorescence analysis. Nat Genet 
1996;14:441-447. 

40. Kozal MJ, Shah N, Shen NP, et al. Extensive polymorphisms ob- 
sen/ed in HlV-1 clade B protease gene using high-density oligo- 
nucleotide arrays. Nat Med 1996;2:753-759. 

41 . Wang DG, Fan JB, Siao CJ, et al. Large-scale identification, map- 
ping, and genotyping of single-nucleotide polymorphisms in the 
human genome. Science 1998;280:1077-1082. 

42. Winzeler EA, Richards DR, Conway AR, et al. Direct allelic varia- 
tion scanning of the yeast genome. Science 1998;281 : 1 1 94-1 1 97. 

43- Chhabra RS, Huff JE, Schwetz BS, Selkirk J. An overview of 
prechronic and chronic toxicity carcinogenicity experimental-study 
designs and criteria used by the National Toxicology Program. 
Environ Health Perspect 1990;86:313-321. 

44. Ermolaeva 0, Rastogi M, Pruitt KD, et al. Data management and 
analysis for gene expression arrays. Nat Genet 1998;20:19-23. 

45. httpy/www.nhgri.nih.gov/DIR/LCG/1 5K/HTML^dbase.html 

46. Samson M, Libert F, Doranz BJ, et al. Resistance to HIV-1 infec- 
tion in Caucasian individuals bearing mutant alleles of the CCR- 
5 chemokine receptor gene. Nature 1996;382:722-725. 

47. Bell DA, Taylor JA, Paulson DF, Robertson CN, Mohler JL, Lucier 
GW. Genetic risk and carcinogen exposure — A common inher- 
ited defect of the carcinogen-metabolism gene glutathione-S- 
transferase Ml (Gstml) that increases susceptibility to bladder 
cancer. J Natl Cancer Inst 1993;85:1 159-1 164. 

48 . http y/www. n iehs. n i h .gov/envgenom^ome. ht m I 



9v 



8 



1 



z: 

o 



«9 





o 

13 



i 



z ^ 
o :s 

CO ^ 

w 

Q 
2 
< 



I 

e 

.2 

5 



2 

2 ^ 

CO V 

K Co 

< ^ 




2 

o 

u 
p 

o 



938*iS8/60:MSSnui 
OZ P P\ 33uai3py I 



B S 

c ^ 

I ^ 

1 1 

5 b 

•8 S, 

o u 

o g 

-a ^ 

c o 
- o 
at v> 



3 t>o 

S> S 



« 2 o 



3*^ bo 4> 



ill 

p ^ « 
H c « 

« -a i 

b «i £ 
^ o ^ 
«^ ci. >» 



«» 6 c 



c 
o 



4^ C P 

C a> M 
a « ^ 

C/3 V-i fc3 

=3 S P 
« 2 § 

r- " O 
* TJ « 

§ I s 

03 U C 

g-4 ^ 

ft> c 

•s 2: o 

" .2 «o 

to CO o 

•pis 
E o > 

Si .S < 



d CO 

.c c 

.•^ J= Cm 

3 e 

S 9 
c s 



o e 

> 



O ft) 

o ^ 
^ o 



s 



GO 

o 
o 



(J CO 

o 



cs ^ 

s ^ 

cd ^ 

IS 

C 

tt) « 60 

E S 

S2 



00 



(O — C 
C -o 
O ft> 



5i K -s 



*5 

c 9 
.9 ^ 

.E « 

o o 

0 o 

•S M 

S CO 

•g "So 

1 -s 

o c 

E g 
q 5 



>v - 
oo E 
o 5 

CO 

SO 



O V 
u C 
o ft> 

il 

CO o 

s § 

CO S 
P u 

8 b 
^ I 

^ u 



si 

-So 

o 

00 E 
.S = 

CO P 

s 

c 

>^ « 

g E 

oo > 

— flj 

CO ^ 
GO e 

a* s 

GO'S 
P Od 

U CO 



* ^ s 

fl3 M 

= 11 

CO 'O >. 
O Si P 

S « t 
Z c o 

c c> 2 

P o >^ 

g 2 

H § S 

o* 

e 9> CO 
(i> •Q ft> 

E •p ^ 

ft) CO «M 

> > o 



o 

H 
O 



I CO 

^ oo 

:S g 
eo "O 

u ft) 
00 •o 

.9 *Q 

ft) CO 

•§•2 

o 



* U 

> ic 

CO GO 

p ^ 

X) P 

CO o 

CO 

0) 3 

S oo 

H P 

2 o 

5 

E o 

^ p 
ft) 



^1 ii\ c 



ft) 



O 

CO 

E ^ 

m 4) 

•S3 CO 
CO •** 

^ CO 

^ § 

e CO. 

s u 
o 

«2 - 
A c 
oo — 

1 



ft) c 

eO) 
ft) 

CO > 

CO > 

^ s> 

CO ^ 

CO JS 

- 1 

g B 

S CO 

3 " 

CO ^ 

c 

CO 

o « 

CO 

ft) 

u g 
CO 



•a 

u 
•o 

o 

- 00 

St 

a vo 

|s 

.2 

CO J, 

«> ^ 
5 8 

c 

oo 

8 o 

ci E 
p ^ o 

S'Ji " 

C > n 

8*8® 

« s; 

3*c IS 

•o »- "2 

Jf* u c 
' g « 

u 



5 I -a I 

*5i ^ w) o «3 E 

S S « g § 3 

- S 3 *3 _j 

E « g •§ 

C o 1^ 3 

S H o " 

CO *^ _ 

2 3 s rr * 

S 5 " « Sl2 

• o "2 e 2 S 

a 5 -I i-s " 

«> o .S2 



i 8 
S e 



^ * A 
^ > 



a> S 

S -a 



GO 

o 

o 
e 



CO 

CO _ 

O -o 



3 o S g o i2 

-S 'S *S o g 

•r« CO 

?5 5 



CO 



CO 

•5? U 



cd c 

2 c S 



00 



c 



•5 ^ e 



■w « " 52 -SS 

0 %i u c:: w o 



f i 

§•2 

-3 3 



« b 3 3 



•c o -3 -a 
o ^ c 

CO .3 ^ C 

o o S S ;r! 

H w S 60 

<S CO u P 



o 



'IS 

Ea 




s 



5 ^ ^ 



* tdi : 

^ CO 

00 ta g 



SP2P i*a 



u u s 

S 6^ 

. - o6 u .2 e 

S « B g H ^ 

ofal 

S a P cn . >s 

S &Q J2 a 

»*r 00 O >> rj S53 is *3 



o *s 

z e 

o 



II 

<i 

^ c 

0\ Q 



^ C- cs _ CO 
Z 9 



o 5 « 5 ^ 




I I 
bO «> 

Ok c 

S ^ 
a B 

•o 6 

CO O 
^ 

0 *C 

!•§ 

CO *Q w 

*s « 

1 5 



«^ 

cd € 
Si 

CO ^ 

^ e 

0 c 

1 ^ 
8 & 

•S « 

cd 

O 9 

C 9> 
O ^ 

a e 



•s 

E 
& 

U 

CO 

CO 



a 

B 
o 
u 

I 

I 



•S 3 



c 

22 



S T3 O 



CO 



12 

« w S 

g CO 

v-i ea 7; 

O CO U 

Um to 'J 

« cd 

III 

P P O 





^ i § S ^ 

S tl § S 

» *r> o * ? 

^ > e O 



0) 



CO CO 

•-^ Cd PC 

§ .tf ^■ 

•S g « -c 
§^ 



5s CO 

GO O 



o 

22 

Cd 

o. 



o u St 
K o 



1 



.> CO -s 
w « P 

S " 5> «j 
S C -o B 

« ag I 

w 2 HI 



Cd u 

e Cd 
§ td 

CO 

CO M 

^ s 

O 

.& * 

I— Vi 

:§ § 
^ e 

.2 S 

eo 

CO *s 

cd c: 

.si 



u 

•a 
c 



.<2 cS 
E 

* g 



Z 
< 

z 
o 

u 
z 
o 
u 



ass 



CO 



GO 

Cd 



GO GO 2 




-§ « o J3 

1 » 3 

GO *c 2 

* g a § 

^ a M 

ea w 



E 

to j3 

.2 ^ 

'•3 2 

CO ^ 




« H ^ 

S 2 o 8 

t> S «0 I 

s i s 

1 8 « 

O U O •'^ 

-5 p rt 

«> ^ a> O 

-o c 2 o 

4> O ^ 

2 « ^ « 

.S < ^ «> 

« s 8 

to ^ C 

S S 4> 



2 S 

M cd 

Cd 

U CO 

;§ S 



a CO 

o o 

CO o 

Cd C 

JS o. 

PL, w 

• c 

S E 

•53 

CO P 




« rs p 

^ » e S 

o -g o 

.2 -3 S fl 
S 2? ^ S fl 



GO 

73 




1 fi a II a 

I § S -3 -J i 

3 t> « ■ 




2 . 



a: E 

0000 



CO 




a 

o S - 

> . r o6 u « s 



CO BS**^ 
. BO 

CO 





— w S . 

3 2 Si ^ 
m V R ^ 
gS g-g 

o i-i c -a 



19 



S 5 ^ ^ S . c^ T3 . 
J W 8 



to 

O "O 



a 

S 

■c 



OS 



goo -V; §?<<«•= 



= f g 1 S S|„ _ 



■s 

o 




Z a. 

CO i> 

S JS 

X u 

-I 
I" 



. r4 O o 
«> ^ r ^ — * 

; "a en 
^ *55 BO 

^ CO 

6 *o E t> _ _ 

SS«: 



05 



G — »o S -g CQ g 



18. 



o « 

O o 

a 
o 



00 

T 

r- 



- o 



m 

2 
•o 

S 

O 

E 
o 



z 
o 

CO 



O 2 

inuco^p*aug< 
< 5 10 



> 



a 



00 



in 




U 

o 



GO CO 2 




0 

1 

Subject: RE: [Fud: Toxicolog} Chip] 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cynihia" <afshari(i'niehs.nih.gov> 
To: "'Diana Hamlei-Cox*" <dianahc@'incyie.com> 

Yo'-; car. see rhe lisr of clones rhat we have or* our l-K ch::p at 
hzrr : nar.u el .r.iehs.nih.gcv r>aps -cuesz 'clonesr cr. . cfr. 

We selected a subset of genes (2000K) that we believed critical to to; 
response and basic cellular processes and added a set of clones and Z"s tr 
this. VJe have included a set of control genes (80-) that were selected ry 
the InI-JGRI because they did not change across a large set of array 
exoerinients , However, we have found that some of these aenes chance 
signficantly after tox treatments and are in the process cf looking at the 
variation of each of these 80* genes across our experiments. 
Our chips are constantly changing and being updated and we hope that cur 
data will lead us to what the toxchip should really be. 
I hope this answers your question. 
Cindy Afshari 



> rroni; Diana Hamlet -Cox 

> Sent: Monday, June 26, 2000 8:52 PM 

> To: afsharlQniehs .nih.gov 

> Subject: [Fwd: Toxicology Chip] 
> 

> Dear Dr, Afshari, 
> 

> Since I have not yet had a response from Bill Grigg, perhaps he was not 

> Che right person to contact. 
> 

> Can you help me in this matter? I. don't need to know the sequences, 

> necessarily, buz I would like very much to know what types of sequences 

> are being used, e.g., GPCRs (more specific?) , ion channels, etc. 
> 

> Diana Hamlet -Cox 
> 

> Original Message 

> Subject: Toxicology Chip 

> Daze: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc&incyte.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg&niehs. nih.gov 
> 

> Dear Colleague: 
> 

> Z am doing literature research on the use of expressed genes as 

> pharmacotoxicology markers, and found the Press Release dated February 

> 29, 2000 regarding the work of the NIZHS in this area. 1 would like to 

> know if there is a resource I can access (or you could provide?) that 

> would give me a list of the 12,000 genes that are on your Human ToxChip 

> Microarray. In particular, I am interested in the criteria used to 

> select sequences for the ToxChip, including any control sequences 

> included in the microarray. 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet-Cox, Ph.D. 

> Incyte Genomics, Inc. 
> 

> — 
> 



Reference 15 of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 



07/31/2000 \0M 



• » i 



> This email message is for the sole use of zhe intended rezi^ier.z s sr.d 

> may conzaiz cor^fider.zial and privileged ir^forrMzior. subjecr 

> az'zomey'Clier.z privilege. Any iir^a'szhcrired review, use, disclos'sre 

> diszribuzior. is prohibized. If you are noz zhe inzended reripienz, 

> please cozzarz zhe sender by reply agnail and deszroy all ccpies cf zhe 

> original message. 



> 
> 
> 



of: 



07/31/2000 >0:34 AM 



research focus 



Reference 1 6 of 20 

with Response dated 05/04/04 

In USSN: 09/857,826 



Proteomics: a major new 
technology for the drug 
discovery process 

Martin J. Page, Bob Amess, Christian Rohlff, Colin Stubberfield 
and Raj Parekh 



Proteomics is a new enabling technology that is being 
integrated into the drug discovery process. This will 
facilitate the systematic analysis of proteins across any 
biological system or disease, forwarding new targets 
and information on mode of action, toxicology and sur- 
rogate markers. Proteomics is highly complementary to 
genomic approaches in the drug discovery process and, 
for the first time, offers scientists the ability to integrate 
information from the genome, expressed mRNAs, their 
respective proteins and subcellular localization. It is ex- 
pected that this will lead to important new insights into 
disease mechanisms and improved drug discovery 
strategies to produce novel therapeutics. 



Among the major pharmaceutical and biotechnol- 
ogy companies, it is clearly recognized that the 
business of modern drug discovery is a highly 
competitive process. All of the many steps in- 
volved are inherently complex, and each can involve a 
high risk of attrition. The players in this business strive 
continuously to optimize and streamline the process; each 
seeking to gain an advantage at every step by attempting 
to make informed decisions at the earliest stage possible. 
The desired outcome is to accelerate as many key activities 
in the drug discovery process as possible. This should pro- 



duce a new generation of robust dmgs that offer a high 
probability of success and reach the clinic and market 
ahead of the competition. 

There has been noticeable emphasis over recent years 
for companies to aggressively review and refine their 
strategies to discover new drugs. Central to this has been 
the introduction and implementation of cutting-edge 
technologies. Most, if not all, companies have now inte- 
grated key technology platforms that incorporate gen- 
omics, mRNA expression analysis, relational databases, 
high-throughput robotics, combinatorial chemistry and 
powerful bioinformatics. Although it is still early days to 
quantify the real impact of these platforms in clinical and 
commercial terms, expectations are high, and it is widely 
accepted that significant benefits will be forthcoming. This 
is largely based on data obtained during preclinical studies 
where the genomic'-^ and microarray^ '' technologies have 
already proved their value. 

However, there are several noteworthy outcomes that re- 
sult from this. Many comments are voiced that scientists 
armed with these technologies are now commonly faced 
with data overload. Thus, in some instances, rather than 
facilitating the decision process, the accumulation of more 
complex data points, many with unknown consequences, 
can seem to hinder the process. Also, most drug compa- 
nies have simultaneously incorporated very similar compo- 
nents of the new technology platforms, the consequence 
being that it is becoming difficult yet again to determine 
where a clear competitive advantage will arise. Finally, in 
recent years, largely as a result of the accessibility of the 
technologies, there has been an overwhelming emphasis 
placed on genomic and mRNA data rather than on protein 



Martin J. Page*, Bob Amess, Christian Rohlff, Colin Stubberfieldand Raj Parekh, Oxford GlycoSciences, 10 The Quadrant, 
Abingdon Science Park. Abingdon, Oxfordshire, UK 0X14 SYS. 'tel: +44 1235 543277. fax: +44 1235 543283. 
e-mail: martin.page@ogs.co.uk 



DDT Vol. 4. No. 2 February 1999 



1 359-6446/99/$ - see from matter © Elsevier Science. All rights reserved. Pll: 51359-6446(98)01291-4 



55 



Sample 



2D gels and 
imaging 





^^^^^ 



Cu ration and 
interrogation 

Composite - normal 




Composite - disease 




Differential analysis 
(Proteograph^") 

MCI 

Fold increase Fold decrease 




S3B' 



Mass spectrometry 
and annotation 



toss 
lost 



rIlOO 
-1000 



8sa-. 



8S2 



900 



845- 



718- 



827 

•10 007 
738 



S88- 



698 — 

812 
811 



E2a 



331- 



438 
416 — 

371^ 

3S4 



328 



197- 



218 



147 



198 



130- 

102- 



800 
700 

h600 jg 

o 

500 w 

Co 



I — I — I — I — I — I — ( — I — I — r 

100 50 0 

Abundance (%) 



400 
1-300 
200 
100 
0 



Figure 1, Steps involved in analysing a biological sample by proteomics. MCI, molecular cluster index. 



analysis. It is important to remember that proteins dictate 
biological phenotype - whether it is normal or diseased - 
and are the direct targets for most drugs. 

Proteomics: new technology for 
the analysis of proteins 

It is now timely to recognize that complementary technol- 
ogy in the form of high-throughput analysis of the total 
protein repertoire of chosen biological samples, namely 
proteomics, is poised to add a new and important dimen- 
sion to drug discovery. In a similar fashion to genomics, 
which aims to profile every gene expressed in a cell, pro- 
teomics seeks to profile every protein that is expressed^''. 
However, there is added information, since proteomics can 
also be used to identify the post-translational modifications 
of proteins*, which can have profound effects on bio- 
logical function, and their cellular localization. Importandy, 
proteomics is a technology that integrates the significant 
advances in two-dimensional (2D) electrophoretic separa- 
tion of proteins, mass spectrometry and bioinformatics. 
With these advances it is now possible to consistently de- 
rive proteomes that are highly reproducible and suitable 
for interrogation using advanced bioinformaiic tools. 

There are many variations whereby different laboratories 
operate proteomics. For the purpose of this review, the 



process used at Oxford GlycoSciences (OGS), which uses 
an industrial-scale operation that is integral to its drug dis- 
covery work, will be described. The individual steps of 
this process, where up to 1000 2D gels can be run and 
analysed per week, are summarized in Fig. 1. The incom- 
ing samples are bar coded and all information relevant lo 
the sample is logged into a Laboratory Information 
Management System (LIMS) database. There can be a wide 
range in the type of samples processed, as applicable to 
individual steps in the drug discovery pipeline, and these 
will be mentioned later. The samples are separated accord- 
ing to their charge (pi) in the first dimension, using iso- 
electric focusing, followed by size (MW) using SDS-PAGE 
in the second dimension. Many modifications have been 
made to these steps to improve handling, throughput and 
reproducibility. The separated proteins are then stained 
with fluorescent dyes which are significandy more sensi- 
tive in detection than standard silver methods and have a 
broader dynamic range. The image of the displayed pro- 
teins obtained is referred to as the proteome, and is digi- 
tally scanned into databases using proprietary software 
called ROSETTA^". The images are subsequently cu rated, 
which begins with the removal of any artefacts, cropping 
and the placement of pI/MW landmarks. The images from 
replicate images are then aligned and matched to one 



56 



DDT Vol. 4, No. 2 February 1999 



research focus 




another to generate a synthetic composite image. This is 
an important step, as the proteome is a dynamic situation, 
and it captures the biological variation that occurs, such 
that even orphan proteins are still incorporated into the 
analysis. 

By means of illustration, Fig. 1 shows the process 
whereby proteomes are generated from normal and dis- 
ease samples and how differentially expressed proteins are 
identified. The potential of this type of analysis is tremen- 
dous. For example, from a mammalian cell sample, in ex- 
cess of 2000 proteins can typically be resolved within the 
proieome. The quality of tliis is shown in Fig. 2, which 
shows representative proteomes from three diverse bio- 
logical sources; human serum, the pathogenic fungus 
Candida albicans and the human hepatoma cell line 
Huh7. 

Use of proteomics to identify 
disease specific proteins 

In most cases, the drug discovery process is initiated by 
the identification of a novel candidate target - almost al- 
ways a protein - that is believed to be instrumental in the 
disease process. To date, there is a variety of means 
whereby drug targets have been forthcoming. These in- 
clude molecular, cellular and genomic approaches, mostly 
centred upon DNA and mRNA analysis. The gene in ques- 
tion is isolated, and expression and characterization of its 
coded protein product - i.e. the drug target - is invariably 
a secondary event. 

With the proteomic approach, the starting point is at the 
other end of the 'telescope'. Here there is direct and im- 



mediate comparison of the proteomes from paired normal 
and disease materials. Examples of these pairs are: (1) pu- 
rified epithelial cell populations derived from human 
breast tumours, matched to purified normal populations of 
human breast epithelial cells, and (2) the invading patho- 
genic hyphal form of C albicans, matched to the non- 
invading yeast form of C. albicans. When the proteome 
images from each pair are aligned, the Proteograph™ soft- 
ware is able to rapidly identify those proteins (each refer- 
enced as having a unique molecular cluster index, or MCI) 
that are either unique, or those that are differentially ex- 
pressed. Thus, the Proteograph output from this analysis is 
both qualitative and quantitative. 

Proteograph analysis for a particular study can also be 
undertaken on any number of samples. For example, one 
might compare anything from a few to several hundred 
preparations or samples, each from a normal and disease 
counterpart, and have these analysed in a single 
Proteograph study. In this way, it is possible to assign 
strong statistical confidence to the data and in some in- 
stances to identify specific subpopulations within the input 
biological sources. This feature will become increasingly 
significant in the near future, and there is a clear synergy 
here whereby proteomics can work closely with pharma- 
cogenomic approaches to stratify patient populations and 
achieve effective targeted care for the patient. Whatever 
the source of the materials, the net output of Proteograph 
analysis is immediate identification of disease specific pro- 
teins. This is shown in Fig. 3, which shows the results of 
a proteograph obtained by comparing untreated human 
hepatoma cells with cells following exposure to a clinical 



(a) 

200 



CO 

Q 



13 



•■:<^ •"'"Yi f 'ft-, 



(b) 

180 



Pl 



10 



O 



-^■■#l!'^■• 



■ 4 ■ 



10 



pl 



10 




10 



Figure 2. Representative proteomes obtained from (a) buman serum, (b) tbe pathogenic fungus Czndidsi albicans 
and (c) the buman bepatoma cell li7ie Huhl, 



DDT Vol. 4. No. 2 February 1999 



57 





research focus 





Foregrounds: 
Backgrounds: 



Huh7 cells treated with 5FU 
Huh7 cells untreated 

Upregulated in Huh7 ceils treated with 5FU 
with respect to untreated Huh? cells 
Downregulated in Huh7 cells treated with 5FU 
with respect to untreated Huh7 cells 




Figure 3- Table of differential protein expression 
profiles, referred to as a Rosetta Proteograph ™, 
between Hub 7 cells with and tvithout the cytotoxic 
agent 5-FU. Bars are quantized and do not represent 
exact fold change values. 



cytotoxic agent. In this instance, only the top 20 differen- 
tially expressed MCIs are shown, but the readout would 
normally extend to a defined cut-off value, typically a two- 
fold or greater difference in expression levels, determined 
by the user. 

In a typical analysis involving disease and normal mam- 
malian material, in which each proteome would have 
-2000 protein features each assigned an MCI, the proteo- 
graph might identify somewhere in the region of 50-300 
MCIs that are unique or differentially expressed. To capi- 
talize rapidly on these data, at OGS a high-throughput 



mass spectrometry facility coupled to advanced databases 
to annotate these MCIs as individual proteins is applied. As 
these are all disease specific proteins, each could represent 
a novel target and/or a novel disease marker. The process 
becomes even more powerful when a panel of features, 
rather than individual features, are assigned. The relevance 
of this is apparent when one considers that most diseases, 
if not all, are multifactorial in nature and arise from poly- 
genic changes. Rather than analysing events in isolation, 
the ability to examine hundreds or thousands of events 
simultaneously, as shown by proteomics, can offer real 
advantages. 

Identification and assignment of candidate targets 
The rapid identification and assignment of candidate tar- 
gets and markers represents a huge challenge, but this has 
been greatly facilitated by combining the recent advances 
made in proteomics and analytical mass spectrometry^. 
Using automated procedures it is now possible to annotate 
proteins present in femtomole quantities, which would de- 
pict the low abundance class of proteins. The process of 
annotation is similarly aided by the quality and richness of 
the sequence specific databases that are currently avail- 
able, both in the public domain and in the private sector 
(e.g. those supplied by Incyte Pharmaceuticals). In this re- 
spect, the advances in proteomics have benefited consider- 
ably from the breakthroughs achieved with genomics. 

From an application perspective, cancer studies provide a 
good opportunity whereby proteomics can be instrumental 
in identifying disease specific proteins, because it is often 
feasible to obtain normal and diseased tissue from the same 
patient. For example, proteomic studies have been re- 
ported on neuroblastomas^ °, human breast proteins from 
normal and tumour sources* lung tumours*'*, colon tu- 
mours*^ and bladder tumours*^. There are also proteomic 
studies reported within the cardiovascular therapeutic area, 
in which disease or response proteins are identified*'''*^. 

Genomic microarray analysis can similarly identify 
unique species or clusters of mRNAs that are disease spe- 
cific. However, in some instances, there is a clear lack of 
correlation between the levels of a specific mRNA and its 
corresponding protein (Ref. 19, Gypi, S.P. et al.^ submit- 
ted). This has now been noted by many investigators and 
reaffirms that post-transcriptional events, including protein 
stability, protein modification (such as phosphorylation, 
glycosylation, acylation and methylation) and cell localiz- 
ation, can constitute major regulatory steps. Proteomic 
analysis captures all of these steps and can therefore pro- 
vide unique and valuable information independent from, 
or complementary to, genomic data. 



58 



DDT Vol. 4, No. 2 February 1999 



Proteotnics for target validation and signal transduc- 
tion studies 

The identification of disease specific proteins alone is in- 
sufficient to begin a dmg screening process. It is critical to 
assign function and validation to these proteins by con- 
firming they are indeed pivotal in the disease process. 
These studies need to encompass both gain- and loss-of- 
fimction analyses. This would determine whether the activity 
of a candidate target (an enzyme, for example), eliminated 
by molecular/cellular techniques, could reverse a disease 
phenotype. If this happened, then the investigator would 
have increased confidence that a small-molecule inhibitor 
against the target would also have a similar effect. The 
proposal of candidate drug targets is often not a difficult 
process, but validating them is another matter. Validation 
represents a major bottleneck where the wrong decision 
can have serious consequences^^. 

Proteomics can be used to evaluate the role of a chosen 
target protein in signal transduaion cascades directly rel- 
evant to the disease. In this manner, valuable information 
is forthcoming on the signalling pathways that are per- 
turbed by a target protein and how they might be cor- 
rected by appropriate therapeutics. Techniques that are 
well established in one-dimensional protein studies to in- 
vestigate signalling pathways, such as western blotting 
and immunoprecipitation, are highly suited to proteomic 
applications. For example, the proteomes obtained can be 
blotted onto membranes and probed with antibodies 
against the target protein or related signalling mol- 
ecules^^"^^. Because proteomics can resolve >2000 pro- 
teins on a single gel, it is possible to derive important 
information on specific isoforms (such as glycosylated or 
phosphorylated variants) of signalling molecules. This will 
result in characterization of how they are altered in the 
disease process. Western immunoblotting techniques 
using high-affinity antibodies will typically identify pro- 
teins present at -10 copies per cell (-1.7 fmol); this is in 
contrast to the best fluorescent dyes currently available 
that are limited to imaging proteins at 1000 or more 
copies per cell. The level of sensitivity derived by these 
applications will greatly facilitate interpretation of com- 
plex signalling pathways and contribute significantly to 
validation of the target under study. 

Immunoprecipitation studies 

Similarly, immunoprecipitation studies are another useful 
way to exploit the resolving power of proteomics^'*^^^. In 
this instance, very large quantities of protein (e.g. several 
milligrams) can be subjected to incubation with antibodies 
against chosen signalling molecules. This allows high-affin- 



ity capture of these proteins, which can subsequently be 
eluted and electrophoresed on a 2D gel to provide a high- 
resolution proteome of a specific subset of proteins. 
Detection by blot analysis allows the identification of ex- 
tremely small amounts of defined signalling molecules. 
Again, the different isoforms of even very low abundance 
proteins ain be seen, and, very importantly, the technique 
allows the investigator to identify multiprotein complexes 
or other proteins that co-precipitate with the target protein. 
These coassociating proteins frequently represent sig- 
nalling partners for the target protein, and their identifi- 
cation by mass spectrometry can lead to invaluable infor- 
mation on the signalling processes involved. 

The depth of signal transduction analysis offered by 
proteomics, and the utility for target validation studies, 
can be extended even further by applying cell fraction- 
ation studies^^^^. By purifying subcellular fractions, such 
as membrane, nuclear, organelle and cytosolic, it is possi- 
ble to assign a localization to proteins of interest and to 
follow their trafficking in a cell. Enrichment of these frac- 
tions will also allow much higher representation of low 
abundance proteins on the proteome. Their detection by 
fluorescent dyes or immunobloi techniques will lead to 
the identification of proteins in the range of 1-10 copies 
per cell, putting the sensitivity on a par with genomic 
approaches. 

These signal transduction analyses can be of additional 
value in experiments where inhibitors derived from a 
screening programme against the target are being evalu- 
ated for their potency and selectivity. The inhibitors can 
encompass small molecules, antisense nucleic acid con- 
stnicts, dominant-negative proteins, or neutralizing anti- 
bodies microinjected into cells. In each case, proteome 
analysis can provide unique data in support of validation 
studies for a chosen candidate drug target. 

Proteomics and drug mode-of -action studies 

Once a validated target is committed to a screening regi- 
men to identify and advance a lead molecule, it is impor- 
tant to confirm that the efficacy of tlie inhibitor is through 
the expected mechanism. Such mode-of-action studies are 
usually tackled by various cell biological and biochemical 
methods. Proteomics can also be usefully applied to these 
studies and this is illustrated below by describing data ob- 
tained with OGT719. This is a novel galaaosyl derivative of 
the cytotoxic agent 5-fluorouracil (5-FU), which is currently 
being developed by OGS for the treatment of hepatocel- 
lular carcinoma and colorectal metastases localized 
in the liver. The premise underpinning the design and ra- 
tionale of OGT719 was to derive a 5-FU prodrug capable 



DDT Vol. 4. No. 2 February 1999 



59 




Figure 4, Features that are specifically up-, or downregulated in Hub? cells by either 3-fluorouracil (5-FU) or 
0GT719: (a) elongation factor la2, (b) novel (three peptides by MS-MS) and (c) a-subunit of prolyl-4-hydroxylase. 
Arrows indicate up- or downregulated. 



of targeting, and being retained in, cells bearing the asialo- 
glycoprotein receptor (ASGP-r), including hepatocytes^^, 
hepatoma Huh7 cells^ and some colorectal tumour cells^'. 
The growth of the human hepatoma cell line Huh7 is in- 
hibited by 5-FU or by OGT719. If the inhibition by 
OGT719 were the result of uptake and conversion to 5-FU 
as the active component, then it would be expected that 
Huh? cells would show similar proteome profiles follow- 
ing exposure to either drug. 

To examine these possibilities, we conducted an experi- 
ment taking samples of Huh7 ceils that had been treated 
with IC5Q doses of either OGT719 or 5-FU. Total cell lysates 
were prepared and taken through 2D electrophoresis, 
fluorescence staining, digital imaging and Proteograph 
analysis. To facilitate the interpretation of the data across 
all of the 2291 features seen on the proteomes, drug- 
induced protein changes of fivefold or greater, identified 
by the Proteograph, were analysed further. Interestingly, 
from this analysis 19 identical proteins were changed five- 
fold or more by both drugs, strongly suggesting similarities 
in the mode of action for these two compounds. 

Thus, from very complex data involving >2000 protein 
features, using proteomics it is possible to analyse quanti- 
tatively and qualitatively each protein during its exposure 
to drugs. The biologist is now able to focus a series of fur- 
ther studies specifically on an enriched subset of proteins. 



Figure A shows highlighted examples of the selected areas 
of the proteome where some of these identified proteins in 
the above study are altered in response to either or both 
drugs. 

Several of the proteins identified above as being modu- 
lated similarly by 3-FU or OGT719 in Huh7 cells were sub- 
jected to tandem mass-spectrometric analysis for anno- 
tation. Some of these, such as the nuclear ribosomal 
RNA-binding protein^^, can be placed into pyrimidine 
pathways or related cell cycle/growth biochemical path- 
ways in which 5-FU is know^n to act. 

To attribute further significance to the proteome mode- 
of-action studies with OGT719, another cell line, the rat 
sarcoma HSN, was used. Growth of these cells is inhibited 
by 5-FU, but they are completely refractory to OGT719; 
notably they lack the ASGP-r, which might explain this 
finding (unpublished). For our proteome studies, HSN 
cells were treated with 5-FU or OGT719 over a time course 
of one, two and four days. At each time point, cells were 
harvested and processed to derive proteomes and 
Proteographs. As before, we purposely focused on those 
proteins that increased or decreased by fivefold or more. 
In this instance, there were no proteins co-modulated by 
the two drugs. This is perhaps to be expected, given that 
the HSN cells are killed by 5-FU and yet are refractory to 
OGT719- 



60 



DDT Vol. 4. No. 2 Februar/ 1999 



Clear potential 

The above is just an example of how proteomics can be 
used to address the mode of action of anticancer drugs. 
The potential of this approach is clear, and one can envis- 
age situations where it will be profitable to compare the 
proteomes of cells in which the drug target has been elimi- 
nated by molecular knockout techniques, or with small- 
molecule inhibitors believed to act specifically on the same 
target. In addition to using proteomics to examine the ac- 
tion of drugs, it is also possible to use this approach to 
gauge the extent of nonspecific effects that might eventu- 
ally lead to toxicity. For instance, in the example used 
above with HSN cells treated with OGT719, although cell 
growth was not affected, the levels of several specific pro- 
teins were changed. Further investigation of these proteins 
and the signalling pathways in which they are involved 
could be illuminating in predicting the likelihood or other- 
wise of long-term toxicity. 

Use of proteomics in formal drug 
toxicology studies 

A drug discovery programme at the stage where leads 
have been identified and mode-of-action studies are ad- 
vanced, will proceed to investigate the pharmacokinetic 
and toxicology profile of those agents. These two param- 
eters are of major importance in the drug discovery 
process, and many agents that have looked highly promis- 
ing from in vitro studies have subsequently failed because 
of insurmountable pharmacokinetic and/or toxicity prob- 
lems in vivo. Whereas the pharmacokinetic properties of a 
molecule can now be characterized quickly and accu- 
rately, toxicity studies are typically much longer and more 
demanding in their interpretation. 

The ability to achieve fast and accurate predictions of 
toxicity within an in vivo setting would represent a big 
step forward in accelerating any drug discovery pro- 
gramme. Toxicity from a drug can be manifested in any 
organ. However, because the liver and kidney are the 
major sites in the body responsible for metabolism and 
elimination of most drugs, it is informative to examine 
these particular organs in detail to provide early indi- 
cations about events that might result in toxicity. 

The basis for most xenobiotic metabolizing activity is to 
increase the hydrophilicity of the compound and so facili- 
tate its removal from the body. Most drugs are metabo- 
lized in the liver via the cytochrome P450 family of en- 
zymes, which are known to comprise a total of -200 
different members-^^*^, encompassing a wide array of 
overlapping specificities for different substrates. In addi- 
tion to clearance, they also play a major role in metabo- 



lism that can lead to the production and removal of toxic 
species, and in some in-stances it is possible to correlate 
the ability or failure to remove such a toxin with a specific 
P450 or subgroup. 

Unique P450 profiles 

Each individual person will have a slightly different P450 
profile, largely from polymorphisms and changes in ex- 
pression levels, although other genetic and environmental 
factors aside from P450 also need to be taken into consid- 
eration, A significant amount of research is currently 
being directed towards this field — known as pharmacoge- 
nomics - with the aim of predicting how a patient will re- 
spond to a drug, as determined by their genetic make- 
up35-37 jj^g marked variation of individuals in their ability 
to clear a compound can be one of the key factors in de- 
ciding the overall pharmacokinetic profile of a drug. Not 
only will this have a bearing on the likelihood of a patient 
responding to a treatment, but it will also be a factor in 
determining the possibility of their experiencing an ad- 
verse effect. 

Many pharmaceutical companies are already employing 
genomic approaches, involving P450 measurements, as a 
key step in their assessment of the toxicological profile of 
a candidate drug and therefore of its suitability, or other- 
wise, to be considered for human clinical trials. There are 
limits to this approach, however. Whereas the P450 mRNA 
profiling can predict with some accuracy the likely meta- 
bolic fate of a drug, it will not provide information on 
whether the metabolites would subsequently lead to tox- 
icity. Besides the patient-to-patient differences in steady- 
state levels of the P450s, there are also characteristic induc- 
tion responses of these enzymes to some drugs. Moreover, 
as there can be some doubt over the correlation of mRNA 
levels and the corresponding protein levels, there is scope 
for misinterpretation of the results and hence real advan- 
tages to be gained from a proteome approach. In both in- 
stances, the ability to examine entire proteome profiles, in- 
cluding the P450 proteins, will be a significant advantage 
in understanding and predicting the metabolism and 
toxicological outcome of drugs. 

In addition to direct organ and tissue studies, the serum, 
which collects the majority of toxicity markers released 
from susceptible organs and tissues throughout the entire 
body, can be utilized. Serum is rich in nuclease activity 
and, as pharmacogenomics is not suited to deal with these 
samples, valuable markers of toxicity could go undetected. 
However, by using proteomics for these types of analyses, 
serum markers (and clusters thereoO are now accessible 
for evaluation as indicators of toxicity. 



DDT Vol. 4. No. 2 February 1999 



61 




Pharmacoproteomics 

Proteomics can thus be used to add a new sphere of 
analysis to the study of toxicity at the protein level, and in 
the era of '-omics* there is a case to be made to adopt the 
term 'Pharmacoproteomics'™'. Animals can be dosed with 
increasing levels of an experimental drug over time, and 
serum samples can be drawn for consecutive proteome 
analyses. Using this procedure, it should be possible to 
identify individual markers, or clusters thereof, that are 
dose related and correlate with the emergence and severity 
of toxicity- Markers might appear in the serum at a defined 
drug dose and time that are predictive of early toxicity 
within certain organs and if allowed to continue will have 
damaging consequences. These serum markers could sub- 
sequently be used to predict the response of each individ- 
ual and allow tailoring of therapy whereby optimal effi- 
cacy is achieved without adverse side effects being 
apparent- This application can obviously extend to track- 
ing toxicity of drugs in clinical trials where serum can be 
readily drawn and analysed- Surrogate markers for drug ef- 
ficacy could also be detected by this procedure and could 
facilitate the challenge of identifying patient classes who 
will respond favourably to a drug and at what dosage. 

Conclusions 

By contrast to the agents administered to patients in clini- 
cal wards, the process of drug discovery is not a prescrip- 
tive series of steps. The risks are high and there are long 
timelines to be endured before it is known whether a can- 
didate drug will succeed or fail. At each step of the drug 
discovery process there is often scope for flexibility in in- 
terpretation, which over many steps is cumulative. The 
pharmaceutical companies most likely to succeed in this 
environment are those that are able to make informed 
accurate decisions within an accelerated process. 

The genomics revolution has impacted very positively 
upon these issues and now has a powerful new partner in 
proteomics. The ability to undertake global analysis of pro- 
teins from a very wide diversity of biological systems and 
to interrogate these in a high-throughput, systematic man- 
ner will add a significant new dimension to drug discov- 
ery. Each step of the process from target discovery to clini- 
cal trials is accessible to proteomics, often providing 
unique sets of data. Using the combination of genomics 
and proteomics, scientists can now see every dimension of 
their biological focus, from genes, mRNA, proteins and 
their subcellular localization. This will greatly assist our 
understanding of the fundamental mechanistic basis of 
human disease and allow new improved and speedier 
drug discovery strategies to be implemented. 



research focus 



REFERENCES 

1 Crcjoke, ST. (1998) Nat. Biotechnol. l6. 29-30 

2 Dykes, C.W. (1996) Br J. Clin. Pbarmacoi 42, 685-^95 
5 Schena, M. et ai (1998) Trends Biotechnol 16, 301-306 

4 Ramsay, G. (1998) Nat. Biotechnol. l6, 40-44 

5 Anderson, N.L. and Anderson, N.G. (1998) Electrophoresvt 19, 
1853-1861 

6 James, P. (1997) Biochem. Biopbys. Res. Commun.2i\, 1-6 

7 Wilkins, M R. et al. (1996) Biotechnol. Genet. Eng. Rev. 
13. 19-50 

8 Parekh, R.B. and Rohlff, C. (1997) Curr. Opin. Biotechnol. 8. 
71&-723 

9 Figeys, D. et al (1998) Electrophoresis 19, 1811-1818 

10 Wimmer, K. et ai (1996) Electrophoresis 17, 1741-1751 

11 Giometti, C.S., Williams. K. and Tollaksen, S.L. (1997) 
Electrophoresis 18, 573-581 

12 Williams, K. et al. (1998) Electrophoresis 1% 333-343 

13 Rasmussen, R.K. et al. (1998) Electrophoresis 19, 818-825 

14 Hirano. T. et al. (1995) Br J. Cancerll, 840-848 

15 Ji, H. etal. (1997) Electrophoresis 18, 605-613 

16 Ostergaard, M. et al. (1997) Cancer Res. 57, 4111-4117 

17 Patel, V.B. et al. (1997) Electrophoresis 18, 2788-2794 

18 Amoct, D. et al (1998) Anal. Biochem. 258, 1-18 

19 Anderson, L. and Seilhamer, J. (1997) Electrophoresis 18, 
533-537 

20 Rastan. S. and Beeley, L.J. (1997) Curr. Opin. Genet. Dev. 7, 
777-783 

21 Gravel, P. etal (1995) Electrophoresis 16, 1152-1159 

22 Qian, Y. et al (1997) Clin. Chem. 43. 352-359 

23 Sanchez, J.C et al (1997) Electrophoresis 18, 638-641 

24 Watts, A.D. et ai (1997) Electrophoresis 18, 1086-1091 

25 Asker, N. et al (1995) Biochem. J. 308. 873-880 

26 Ramsby, M.L., Makowski, G.S. and Khairallah, E.A. (1994) 
Electrophoresis 15, 265-277 

27 Huber. LA. (1995) FEBSLett. 369. 122-125 

28 Corthals, G.L et ai (1997) Electrophoresis 18, 317-323 

29 Hubbard, A.L, Wall, D.A. and Ma, A. (1983)/ Ce// Bio/. 96, 
217-229 

30 Zeng, F.Y., Oka. J. A. and Weigel, RH. (1996) Biochem. Biophys. 
Res. Commun. 218. 325-330 

31 Mu, J-Z. et al (1994) Biochim. Biophys. Acta 1222, 483-491 

32 Ghoshal, K. and Jacob, S.T. (1997) Biochem. Pharmacol 53, 
1569-1575 

33 Guengerich, P.P. and Parikh, A. (1997) Cwrr. Cpin. Biotechnol 8, 
623-628 

34 Rcndic, S. and Di Carlo, F.J. (1997) Drug Metah. Rev. 29, 413-580 

35 Vermes. A., Guchelaar, H.J. and Koopmans, R.P. (1997) Cancer 
Treat. Rev. 25, 321-339 

36 Housman, D. and Ledley. P.O. (1998) Nat. Biotechnol. 16, 492-493 

37 Persidis, A. (1998) Nat. Biotechnol l6, 209-210 



62 



DDT Vol. 4. No. 2 February 1999 



ReferencelS of 20 
with Response dated 05/04/04 
In USSN: 09/857,826 




jf these 



iTiinor cell pfoteins differ among cells to the same extent as the 



^xs- " ^ proteins, as is commonly assumed, only a small number of pro- 
^^B^^ fnerhaps several hundred) suffice to create very large differences 
<'^iirer«''5n^v and 



behavior. 



Change the Expression of Its Genes 
A C^'^ a^se to External Signals ^ 

in ^ e specialized cells in a multicellular organism are capable of altering 
^{(ift ^^^^^ns of gene expression in response to extracellular cues. If a liver cell 
'^(\r P^^^ a glucocorticoid hormone, for example, the production of several 
jiif^iP^^^ oteins is dramatically increased. Glucocorticoids are released during 
Ipffi^^^^ starvation or intense exercise and signal the liver to increase the 
pffjof^^ glucose from amino acids and other small molecules; the set of 
verbose production is induced includes enzymes such as tyrosine amino- 
pfot^''^^ g which helps to convert tyrosine to glucose. When the hormone is no 
(^nsf^^ j.gsent, the production of these proteins drops to its normal level. 
\o^f>^^^ r cell types respond to glucocorticoids in different ways. In fat cells, for 
I the production of tyrosine aminotransferase is reduced, while some 
fVa^P^jj ^es do not respond to glucocorticoids at all. These examples illustrate 
lyi^^^ I feature of cell specialization — different cell types often respond in dif- 
.1 P**" ^gys to the same extracellular signal. Underlying this specialization are 
ft'f***^^ that do not change, which give each cell type its permanently distinc- 
**^''cha^f^^^®^' ^^^^^ features reflect the persistent expression of different sets of 

g £xpres$ion Can Be Regulated at Many of the Steps 
the pathway from DNA to RNA to Protein ^ 

f differences between the various cell types of an organism depend on the par- 
ular genes that the cells express^ at what level is the control of gene expression 

!\crcised? There are many steps in the pathway leading from DNA to protein, and 
all of them can in principle be regulated. Thus a cell can control the proteins it 
niakes by (1) controlling when and how often a given gene is transcribed (tran- 
^^piional control), (2) controlling how the primary RNA transcript is spliced or 
pihenvise processed (RNA processing control). [3) selecting which completed 
niRNAs in the cell nucleus are exported to the cytoplasm (RNA transport con- 
troDr (4) selecting which mRNAs in the cytoplasm are translated by ribosomes 
iiranslational control), (5) selectively destabiUzing certain mRNA molecules in 
(he cytoplasm (mRNA degradation control), or (6) selectively activating, inacti- 
vating, or compartmentalizing specific protein molecules after they have been 
made (protein activity control) (Figure 9-2). 

For most genes transcriptional controls are paramount. This makes sense 
because, of aD the possible control points illustrated in Figure 9-2, only transcrip- 
(ional control ensures that no superfluous intermediates are synthesized. In the 



NUCLEUS 



primary 

RNA 
transcript 



mRNA 



1 



transcriptional 
control 



RNA 
processing 
control 





CVTOSOL 








3 




RNA 




transport 




control 



mRNA. 



translation 
contror 



inactive mRNA 



mRNA 

degradation 5 
control 



protein 
activity 
control 

n 




Figure 9-2 Six steps at which 
eucaryote gene expression can be 
controlled. Only controls that operate 
at steps 1 through 5 are discussed in 
this chapter. The regulation of protein 
activity (step 6) is discussed in 
Chapter 5; this includes reversible 
activation or in activation by protein 
phosphorylation as weli as 
irreversible inactivation by proteolytic 
degradation. 



^ Overview of Gene Control 



403 



/Tiethyfation of most 
CG sequences in 
germ line 



1 



many miJIions of years 
of evolution 



I 



tl if/lil 



lilJIJf I 




RNA 




^ I II I I II II 1 1 II ■ ■ 1, 111 , i iB MMiNi i aMm n ^ 



VERTEBRATE ONA 

'""""I'M rmc 



I 



1000 nucleotide pairs 




CG island 



Figures-? I Amechani8,n 
both the marked deflcleiic^?*»«in 
sequences and the presence of 
Islands hi vertebrate genomes, 
black /remarks the location of an 
unmethyiated CG dinucleotide in the 
DNA sequence, whiie a red line marks 
the location of a methylated CG 
dinucleotide. 



three 

:nes. 
ent of 
for 
xons 
/e to 
d from 
M7, 



jt,, many types ofceUs in animaU and plants are created largely through mecha- 
„is,ns that cause d^erent genes to be transcribed in different cells. Since marry spe- 
^lUed animal cells can nuuntam their unique character when grown in culture, the 
gene regul^^ rnechamsms involved in creating them must be stable once estab- 
luedand heritable when the cell divides, endowing the cell with a memory of its 
developmental history^ Procaryotes and yeasts provide unusually accessible model 
„„ems in which to study gene regulatory mechanisms, some of which may be rel- 
„^,U to the creatu>n of specialized cell types in higher eucaryotes. One such mecha- 

'""f" 'T.T !Z' (or more) gene regulatory pro- 

,eins, each ofwh^h inhibits the synthesis of the other; this can create afZ flop 

liZl!^ilT TT, ""^^""""^ P^"^^ of gene expreLon. Di- 
tndirect positive feedback loops, which enable gene regulatory proteins to 
perpetuate their own synthesis, provide a general mechanism for cell memory 
In eucaryotes gene transcription is generally controlled by combinations of gene 

rep>latoryproteins.It,sthoughtthateachlypeofcellinahighereuca^^^ 
contains a sp^fic combination of gene regulatory proteins that ensures th^expres- 
,io„ of only those genes appropriate to that type of cell A given gene regulatory pro- 
,ein may be ^pressed m a variety of circumstances and typically is inZlvedinthe 

reguiationof many genes. 

In addition to diffusible gene regulatory proteins, inherited states of chromatin 

TTnrr.r^T"'^^ '"'^'^^^■^ ^^'^ reg«/areg.«. expression. In vel 
letrates DNA methylaUon also plays a part, mainly as a device to reinforce decisions 
about gene expression that are made initially by other mechanisms. 

Posttranscriptional Controls 

AJihough controls on the initiation of gene transcription are the predominant 
Z J^^g^^ation for most genes, other controls can act later in the pX^^^ 

ZTuT^ posttranscriptional controls, which operate after RNA polymerase 
h^Zc^^^^^^^^ rr'^' RNA synthesis, are lesT^mon 

ep i^rn "^^^^ ^""^ Senes they are cmcial. It seems that evew 

«a under some circumstances for some genes 
'^^•^ccTrZt^^^^^ of posttranscriptional regulation in temporal or. 

"^oiecaie aip? 't . ^ "^^^ "^^^ht be experienced by an RNA 

'ecuie after its transcnption has begun (Figure 9-72). y 



START RNA 
TRANSCmPTlON 



POSSIBLE 
ATTENUATION 

SPLICING 
AND 3*.END 
CLEAVAGE 

NUCLEAR 
EXPORT 




SPATIAL 
LOCAU2ATI0N 
IN CrrOPLASM 

POSSIBLE 
RNA EDITING 

START 
TRANSLATION 

POSSIBLE 
TRANSLATIONAL 
RECODING 

POSSIBLE 
RNA 
STABILIZATION 



RNA 
transcript 
abons 

nonfunctional 
mRNA 
sequences 

retention 
in nucleus 




T 




translation 
bJocked 




RNA degraded 



CONTINUED 
PROTEIN SYNTHESIS 

Figure 9-72 Possible post- 
transcriptionaJ controls on gene 
expression. Only a few of these 
conn-ols are likely to be used for any 
one gene. 



scriptionai Controls 



453 



NCBI Sequence Viewe wysiwyg://6/http://www.ncbi.nIm.nih.gov:...ntrez/viewer.fcgi?db==protein&val=22714' 



Referencel9 of 20 
with Response dated 05/04/04 



% NCBI 




-'0^^ In USSN: 09/857,826 


Entrez PubMed 


Nucleotide Protein Genome Structure 


PMC Taxononny Books 


. Search i Protein 


•Flfbrf- : ■ ■ 


IGo 1 Clear ; 




Limits Preview/Index History 


Clipboard Details 


Display f default 


P| Show: 20 ▼! ! Sendto ^ File [t| 


1 Get Subsequence 1 Fe| 



□ 1: AAC52025 . clone 22 [Homo sa...[gi:2271473] BLink, Links 



LOCUS 

DEFINITION 
ACCESSION 
VERSION 
DBSOURCE 
KEYWORDS 
SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 

TITLE 

JOURNAL 
MEDLINE 
PUBMED 
REFERENCE 
AUTHORS 
TITLE 

JOURNAL 
MEDLINE 
PUBMED 
REFERENCE 
AUTHORS 
TITLE 
JOURNAL 

COMMENT 
FEATURES 

source 



ORIGIN 



AAC52025 248 aa linear PRI 17-FEB-1998 

clone 22 [Homo sapiens] . 
AAC52025 

AAC52 025 .1 01:2271473 
locus AF009426 accession AF009426 . 1 

• 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (residues 1 to 248) 

Yoshikawa, T . , Sanders, A. R. , Esterling, L. E. , Overhauser , J. , 
Games , J . A. , Lennon, G . , Grewal , R. and Detera-Wadleigh, S . D . 
Isolation of chromosome 18-specific brain transcripts as positional 
candidates for bipolar disorder 
Am. J. Med. Genet. 74 (2), 140-149 (1997) 
97275951 
9129712 

2 (residues 1 to 248) 

Yoshikawa, T. , Sanders, A. R. , Esterling , L . E . and Detera-Wadleigh, S . D . 
Multiple transcriptional variants and RNA editing in ClSorfl, a 
novel gene with LDLRA and transmembrane domains on 18pll.2 
Genomics 47 (2), 246-257 (1998) 
98140124 
9479497 

3 (residues 1 to 248) 
Yoshikawa, T. and Detera-Wadleigh, S . D . 
Direct Submission 

Submitted (20-JUN-1997) Clinical Neurogenetics Branch, National 
Institute of Mental Health, Bethesda, MD 20892, USA 
Method: conceptual translation supplied by author. 

Location/ Qualifiers 
1. .248 

/organism="Homo sapiens" 
/ db_xr e f = " t axon : 9 6 0 6 " 
/chromosome= " 18 " 
/map="18pll.2" 
1. .248 

/product=" clone 22" 
1. .248 

/coded_by= "AF009426 . 1 : 243 . . 989" 

/note=" alternatively spliced; beta-1 form; possible 
membrane-spanning protein" 

1 maaelefaqi iiiwwtvm vwivcllnh ykvstrsfin rpnqsrrred glpqegclwp 

61 sdsaaprlga seimhaprsr drftapsfiq rdrfsrfqpt ypyvqheidl pptislsdge 

121 epppyqgpct Iqlrdpeqqm elnresvrap pnrtifdsdl idiamysggp cppssnsgis 

181 astcssngrm egppptysev mghhpgasfl hhqrsnahrg srlqfqqnna estivpikgk 

241 drkpgnlv 



Protein 



CDS 



// 



1 of 2 



3/26/2004 3:24 PM 



Reference20 of 20 

with Response dated 05/04/04 

InUSSN: 09/857,826 



Confidential -- Property of Incyte Corporation LifeSeq Gold 5.1 Nov 2002 

Program: blastp 
Sequence ID(s) : 

1871288CD1 (LGflJAN2002p) vs. genpeptl38 

NCBI-BLASTP 2.2.3 [May-13-2002 ] 

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25:3389-3402. 

Query= 1871288CD1 

(252 letters) 

Database: genpeptl38 

1,601,536 sequences; 494,245,048 total letters 

Searching done 

Score E 

Sequences producing significant alignments : (bits) Value 

g2271473 clone 22 [Homo sapiens] 327 le-88 



>g2271473 clone 22 [Homo sapiens] 

Length =248 

Score = 327 bits (838), Expect = le-88 

Identities = 169/250 (67%), Positives = 190/250 (75%), Gaps = 7/250 (2%) 

Query: 2 AELEFVQIIIIVWMMVMWVITCLLSHYKLSARSFISRHSQGRRREDALSSEGCLWPSE 61 

AELEF QIIIIVW+ VMVWI CLL+HYK+S RSFI+R +Q RRRED L EGCLWPS+ 
Sbjct: 3 AELEFAQIIIIWWTVMWVIVCLLNHYKVSTRSFINRPNQSRRREDGLPQEGCLWPSD 62 

Query: 62 STVSGNGIPEPQVYAPPRPTDRLAVPPFAQRERFHRFQPTYPYLQHEIDLPPTISLSDGE 121 

S G E + PR DR P F QR+RF RFQPTYPY+QHEIDLPPTISLSDGE 

Sbjct : 63 SAAPRLGASE--IMHAPRSRDRFTAPSFIQRDRFSRFQPTYPYVQHEIDLPPTISLSDGE 120 

Query: 122 EPPPYQGPCTLQLRDPEQQLELNRESVRAPPNRTIFDSDLMDSARL-GGPCPPSSNSGIS 180 

EPPPYQGPCTLQLRDPEQQ+ELNRESVRAPPNRTIFDSDL+D A GGPCPPSSNSGIS 
Sbjct: 121 EPPPYQGPCTLQLRDPEQQMELNRESVRAPPNRTIFDSDLIDIAMYSGGPCPPSSNSGIS 180 

Query: 181 ATCYGSGGRMEGPPPTYSEVIGHYPGSSFQHQQSSGPPSLLEGTRLHHTHIAPLESAAIW 240 

A+ S GRMEGPPPTYSEV+GH-fPG+SF HQS + G+RL ES + 

Sbjct: 181 ASTCSSNGRMEGPPPTYSEVMGHHPGASFLHHQRS NAHRGSRLQFQQ-NNAESTIVP 236 



Query: 241 SKEKDKQKGH 250 

K KD++ G+ 
Sbjct: 237 IKGKDRKPGN 246 



